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FOREWORD 


This  document  describes  the  third  year  of  research  and  summarizes  ear¬ 
lier  research  on  the  Army's  current,  large-scale  manpower  and  personnel 
effort  for  Improving  the  selection,  classification,  and  utilization  of  Army 
enlisted  personnel.  The  thrust  for  the  project  came  from  the  practical, 
professional,  and  legal  need  to  validate  the  Army  Services  Vocational  Ap¬ 
titude  Battery  (ASVAB— the  current  U.S.  military  selectlon/classlflcatlon 
test  battery)  and  other  selection  variables  as  predictors  of  training  and 
performance. 

Project  A  Is  being  conducted  under  contract  to  the  Selection  and  Clas¬ 
sification  Technical  area  (SCTA)  of  the  Manpower  and  Personnel  Research 
Laboratory  (MPRL)  at  the  U.S.  Army  Research  Institute  for  the  Behavioral  and 
Social  Sciences.  The  portion  of  the  effort  described  herein  Is  devoted  to 
the  development  and  validation  of  Army  Selection  and  Classification  Measures, 
and  referred  to  as  "Project  A."  This  research  supports  the  MPRL  and  SCTA 
mission  to  Improve  the  Army's  capability  to  select  and  classify  Its  appli¬ 
cants  for  enlistment  or  reenlistment  by  ensuring  that  fair  and  valid  measures 
are  developed  for  evaluating  applicant  potential  based  on  expected  job  per¬ 
formance  and  utility  to  the  Army. 

Project  A  was  authorized  through  a  Letter,  DCSOPS,  "Army  Research 

Project  to  Validate  the  Predictive  Value  of  the  Armed  Services  Vocational 
Aptitude  Battery,"  effective  19  November  1980}  and  a  Memorandum,  Assistant 
Secretary  of  Defense  (MRA&L ) ,  "Enlistment  Standards,"  effective  11  Septem¬ 
ber  1980. 

In  order  to  ensure  that  Project  A  research  achieves  Its  full  scien¬ 
tific  potential  and  will  be  maximally  useful  to  the  Army,  a  governance 

advisory  group  comprised  of  Army  General  Officers,  Interservice  Scientists, 
and  experts  In  personnel  measurement,  selection,  and  classification  was 

established.  Members  of  the  latter  component  provide  guidance  on  technical 
aspects  of  the  research,  while  general  officer  and  Interservice  components 
oversee  the  entire  research  effort,  provide  military  judgment,  provide 
periodic  reviews  of  research  progress,  results,  and  plans,  and  coordinate 
within  their  commands.  Members  of  the  General  Officers'  Advisory  Group 
Include  MG  Porter  (DMPM)  (Chair),  MG  Briggs  (FORSCOM,  DCSPER),  MG  Knudson 
(DCSOPS),  BG  Franks  (USAREUR,  ADCSOPS),  and  MG  Edmonds  (TRADOC,  DCS-T).  The 
General  Officers'  Advisory  Group  was  briefed  In  May  1985  on  the  Issue  of 
obtaining  proponent  concurrence  of  the  criterion  measures  prior  to  adminis¬ 
tration  In  the  concurrent  validation.  Members  of  Project  A's  Scientific 

Advisory  Group  (SAG),  who  guide  the  technical  quality  of  the  research, 
Include  Drs.  Milton  Hake!  (Chair),  Philip  Bobko,  Thomas  Cook,  Lloyd 
Humphreys,  Robert  Linn,  Mary  Tenopyr,  and  Jay  Uhlaner.  The  SAG  was  briefed 
In  October  1984  on  the  results  of  the  Batch  A  field  test  administration. 
Further,  the  SAG  was  briefed  in  March  1985  on  the  contents  of  the  proposed 
Trial  Battery. 


v 


A  comprehensive  set  of  new  selectlon/classlflcatlon  tests  and  job  per¬ 
formance/training  criteria  have  been  developed  and  field  tested.  Results  from 

the  Project  A  field  tests  and  subsequent  concurrent  validation  will  be  used 
to  link  enlistment  standards  to  required  job  performance  standards  and  to  more 
accurately  assign  soldiers  to  Army  jobs. 
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EDITOR'S  PREFACE 


This  Project  A  Annual  Report  for  Fiscal  Ye.  -*  i'!35  has  a  different  form 
than  the  reports  for  previous  years.  It  i»  ■‘•’tended  to  be  a  comprehensive 
and  reasonably  detailed  summary  of  the  first  ..  years  of  the  Army  Selection 
and  Classification  Project  (Project  A).  The  first  3  years  are  noteworthy 
because  they  encompass  all  the  development  work  on  the  broad  array  of 
selectlon/classlflcatlon  tests  and  performance  criteria  upon  which  the 
concurrent  and  longitudinal  validations  will  be  based.  Consequently*  this 
report  Is  meant  to  be  an  account  of  Instrument  development,  from  the  con¬ 
ceptualization  of  the  domains  to  be  assessed  to  a  description  of  the  final 
revisions  of  the  measures  themselves. 

Three  years  may  seem  like  a  long  time  to  spend  on  development  work  but 
we  hope  that  by  the  end  of  the  report  the  reader  will  be  convinced  that 
Project  A  Is  of  a  different  order  of  magnitude  than  most  personnel  research 
projects  and  that  3  years  was  a  bare  minimum.  Future  annual  reports  and 
contract  deliverables  will  report  on  the  validation  results,  estimates  of 
classification  efficiency,  and  results  bearing  on  general  Issues  In  ability 
measurement  and  job  performance  assessment. 

The  bulk  of  this  report  consists  of  edited  and  abridged  material  from 
a  series  of  field  test  reports  produced  by  the  project's  Individual  re¬ 
search  teams.  Consequently,  while  the  current  volume  Includes  considerable 
detail,  an  even  more  detailed  account  can  be  found  In  the  Individual  field 
test  reports,  which  are  supplemented  by  extensive  appendixes  Issued 
separately. 

In  general,  the  primary  sources  for  the  major  sections  were  as  follows: 

e  Part  I  Is  largely  based  on  the  Annual  Reports  for  FY83  and  FY84: 

Improving  the  Selection,  Classification,  and  Utilization  of 
Army  Enlisted  Personnel :  "“Annual  Report,  by  Human  Resources 
Research  Organization,  American  Institutes  for  Research, 
Personnel  Decisions  Research  Institute,  and  Army  Research 
Institute,  AR1  Research  Report  1347,  1983.  (AD  A141  807) 

Improving  the  Selection.  Classification,  and  Utilization  of 
Army  Enlisted  Personnel:  TechnlcaTAppendlx  to  the  AnnuaT 
Report,  Newel)  K.  fa ton  and  Marvin  fT  Goer  (Fds. ) ,  ARI 
Research  Note  83-37,  1983.  (AD  A137  117) 

Improving  the  Selection,  Classification,  and  Utilization  of 
Army  Enlisted  Personnel:  Annual  Report  Synopsis,  1964 
fiscal  Year,  Ey  Human  Resources  Research"  Organization, 
American  Institutes  for  Research,  Personnel  Decisions 
Research  Institute,  and  Army  Research  Institute,  ARI 
Research  Report  1393,  1984.  (AD  A173  8245 
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Improving  the  Selection ,  Classification,  Utilization  of 

Arnw  Enlisted  Personnel:  Annual  Report.  1384  Fiscal  Year? 
Newell  K,  Eaton,  Marvin  H.  Goer,  James  H.  Harris,  and  lot e 
H.  Zook  (Eds.),  ARI  Technical  Report  660,  1984.  (AD  A178  944) 


Imi 


iprovlnc 

Army  Enlisted  Personnel; 


the  Selection ,  Classification,  and  Utilization  of 
Appendices  jo  Annual  Report,  1984 
Resources  Research  Organization, 
Research,  Personnel  Decisions 
Army  Research  Institute,  ARI 


Fiscal  Year,  By  Human 
American  Institutes  for 

Research  Institute,  and 
Research  Note  85-14,  1984. 


Part  II  Is  composed  of  edited  and  abridged  material  from  the  fol¬ 
lowing  predictor  field  test  reports: 

Development  and  Field  Test  of  the  Trial  Battery  forProJ~^ 
A,  Norman  Peterson  (£d.),  ARI  technical  Report  7Wt  1987. 
Authors  of  Individual  chapters  Include  Norman  Peterson,  Jody 
Toquam,  Leaetta  Hough,  Janls  Houston,  Rodney  Rosse,  Jeffrey 
McHenry,  Teresa  Russell,  VyVy  Corpe,  Matthew  McGue,  Bruce 
Barge,  Marvin  Dunnette,  John  Kamp,  and  Mary  Ann  Hanson.  In 
preparation. 

Development  and  Field  Test  of  the  Trial  Battery  for  Project 
AT^opendixes~TrTltrTechnicaV  Report  739,  Norman  Peterson 
(£d.),  ARI  Research  Note  in  preparation. 

Part  III  is  primarily  drawn  from  the  series  of  field  test  reports 
dealing  with  criterion  development: 


Development  and  Field  Test  of  Job-Relevant  Knowledge  Tests 
for  Selected  MdS,  by  Robert  R.  Davis,  Gregory  AT  Davis,  John 
N.  Joyner,  and  Marla  Veronica  de  Vera,  ARI  Technical  Report 
757,  1987.  In  preparation. 

Development  and  Field  Test  of  Job-Relevant  Knowledge  Tests 
for  Selected  flCS:  ~  Appendixes  to  ART  Technical  Report ~7577 
by  Robert  H.  Davis,  Gregory  A.  Davis,  John  N.'" Joyner,  and 
Marla  Veronica  de  Vera,  ARI  Research  Note  In  preparation. 

Development  and  Field  Test  of  Army-Hide  Rating  Scales  and 
the '  Rater  Orientation  and  Training  Program,  TT a  1 he 
Pulakbs  and  Walter  C.  Borman  (Eds.),  ARI  Technical  Report 
716,  1986.  Authors  of  Individual  chapters  Include  Walter  C. 
Borman,  Sharon  R.  Rose,  and  Elaine  D.  Pulakos.  (AD  B112  857) 

Development  and  Field  Test  of  Army-Wide  Rating  Scales  and 
tne  Rater  Orientation  and  Training  Program:  Appendixes"  to 
AIH~Tech'rncal  Report  716,  Elaine  D.  Puiakos  and  Walter  C. 
Borman  (Eds.),  AP.I  Research  Note  87-22,  1987.  In 

preparation. 


Development  and  Field  Test  of  Task-Based  MQS-Speclfic  Cri¬ 
terion"  Mea  surest  by  Charlotte  H.  Campbell,  Ro.y  C.  Campbell. 
Michael  e.  Rumsey,  and  Dorothy  C.  Edwards,  ARI  Technical 
Report  717,  1986.  In  preparation. 

Development  and  Field  Test  of  Task-Based  HQS-Speclflc  Cri¬ 
terion  Measures:  Appendixes  to  ARI  Technical  Report  ~7T77  by 
Charlotte  H.  Campbell,  Roy  C.  Campbell,  Michael  6.  Rumsey, 
and  Dorothy  C.  Edwards,  ARI  Research  Note  In  preparation. 


Development  and  Field  Test  of  Behavlorally  Anchored  Rating 
scales  for  Nine  MD5,  by  Jody  L.  toquam,  Jeffrey  J.  McHenry, 
VyVy  A.  Corpe,  Sharon  R.  Rose,  Steven  E.  Lamrnleln,  Edward 
Kemery,  Walter  C.  Borman,  Raymond  Mendel,  and  Michael  J. 
Bosshardt,  ARI  Technical  Report  In  preparation. 

Development  and  Field  Test  of  Behavlorally  Anchored  Rating 
locates  for  Nine  MtisY  Appendixes  to"lftl  Technical  Report,  by 
Jody  "L.  foquam,  Jeff rey  J.  kcH^  A“Torpe,  Sna'ron  R. 

Rose,  Steven  E.  Lamrnleln,  Edward  Kemery,  Walter  C.  Borman, 
Raymond  Mendel,  and  Michael  J.  Bosshardt,  ARI  Research  Note 
In  preparation. 

The  Development  of  Administrative  Measures  as  Indicators  of 
Soldier  Effectiveness,  by  Barry  J.  Rlegelhaupt,  Carolyn 
DeMeyer  Harris,  and  Robert  Sadacca,  ARI  Technical  Report 
754,  1987.  In  preparation. 

The  Introduction  to  Part  III  Is  a  creation  of  the  editor.  The 
description  of  the  criterion  field  test  procedures  and  the 
general  summary  of  criterion  field  test  results  are  edited 
versions  of  material  from  a  1985  paper: 

Criterion  Reduction  and  Combination  via  a  Partlclpat 1 ve 
Dec  is  ion -Making  franel,  by  John  P.  Fanipbel  1  and  James  J. 
Harris,  paper  presented  at  the  convention  of  the  American 
Psychological  Association,  Los  Angeles,  1985. 

The  descriptions  of  the  development  and  testing  of  the  combat 
performance  prediction  scales  are  based  primarily  on  material 
supplied  by  Barry  J.  Rlegelhaupt  and  Robert  Sadacca. 

e  Part  IV  was  assembled  by  the  editor  with  assistance  from  James 
Harris  and  Laurie  Wise.  Much  of  the  material  comes  from  briefing 
materials  developed  by  Harris,  Wise,  and  the  editor. 

Various  technical  papers  prepared  during  FY85  on  specialized  aspects 
of  the  Project  A  research  are  made  available  In  a  supplement  to  this  re¬ 
port:  Improving  the  Selection,  Classification,  and  Utilization  of  Army 


Enlisted  Personnel:  Annual  Report,  1985  Fiscal  Year— Supplement  to  ARI 

Technical  Report  746,  AKI  Research  Note  in  preparation^  These  papers  are 
listed  tn  Appendix  A  of  the  present  report. 

Additional  editorial  assistance  was  provided  by  Lola  f*.  Zook.  Barbara 
Hamilton  cut,  spliced,  typed,  and  retyped  many  versions  of  the  original 
manuscript. 


John  P.  Campbell 
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IMPROVING  THE  SELECTION,  CLASSIFICATION,  AND  UTILIZATION  OF  ARMY  ENLISTED 
PERSONNEL:  ANNUAL  REPORT,  1985  FISCAL  YEAR 


EXECUTIVE  SUMMARY 


Requ Irement: 

Project  A  Is  a  comprehensive,  long-range  U.S.  Army  program  to  develop 
an  Improved  personnel  selection  and  classification  system  for  enlisted  per¬ 
sonnel.  The  system  encompasses  675,000  persons  and  several  hundred  military 
occupational  specialties  (MOS).  The  objectives  are  to  (a)  validate  existing 
selection  measures  against  both  existing  and  project-developed  criteria,  and 
to  develop  new  measures;  and  (b)  validate  early  criteria  (e.g.,  performance 
In  training)  as  predictors  of  later  criteria  (e.g.,  job  performance  rat¬ 
ings),  to  Improve  reassignment  and  promotion  decisions. 

Procedure: 

Under  the  sponsorship  of  the  U.S.  Army  Research  Institute  (ARI),  work 
on  the  9-year  project  was  begun  In  1982.  The  research  Involves  an  Itera¬ 
tive  progression  of  development,  testing,  evaluation,  and  further  develop¬ 
ment  of  selectlon/classlflcatlon  Instruments  (predictors)  and  measures  of 
Job  performance  (criteria). 

In  the  first  stage,  file  data  from  FY81/02  Army  accessions  were  used 
to  explore  the  relationships  between  the  scores  applicants  made  on  the 
Armed  Services  Vocational  Aptitude  Battery  (ASVAB)  and  their  later  perfor¬ 
mance  In  training  and  first-tour  skill  tests.  The  second  stage  Is  being 
executed  with  FY83/84  accessions;  the  19  MOS  In  the  sample  were  selected  as 
representative  of  the  Army's  250+  entry-level  MOS  and  account  for  45X  of 
Army  accessions.  A  preliminary  battery  of  perceptual,  spatial,  tempera¬ 
ment,  Interest,  and  biodata  predictor  measures  was  tested  on  several  thou¬ 
sand  soldiers  as  they  entered  four  MOS;  subsequent  versions  were  pilot 
tested  and  field  tested  with  nine  MOS.  The  resulting  predictor  battery, 
along  with  a  comprehensive  set  of  job  knowledge  tests,  hands-on  job  sam¬ 
ples,  and  performance  ratings,  Is  being  administered  to  19  MOS.  In  the 
third  stage,  all  of  the  measures,  refined  from  experience,  will  be  used  to 
test  about  50,000  soldiers  across  19  MOS  In  the  FY86/87  predictor  battery 
administration  and  subsequent  measurement  of  first-tour  performance.  About 
3,500  are  expected  to  be  available  for  second-tour  performance  measurement 
In  FY91 . 


Findings: 

The  wide  variety  of  predictor  and  criterion  measures  under  development 
were  extensively  field  tested  during  FY84  and  the  first  half  of  FY85,  the 


I 


•CNVfiV 


nxm 


third  year  of  effort  in  the  project.  These  tests  resulted  in  the  Trial 
Battery's  being  used  in  the  "Concurrent  Validation"  phase  begun  in  FYC5. 


Utilization  of  Findings: 

The  full  array  of  selectlon/classlflcatlon  measures  of  job  and  training 
performance  from  Project  A  Is  being  utilized  In  current  and  long-range  re¬ 
search  programs  expected  to  make  the  Army  more  effective  In  matching  first- 
tour  enlisted  manpower  requirements  with  available  personnel  resources. 
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PART  I 


OVERVIEW  OF  PROJECT  A  AND 
SUMMARIES  OF  FISCAL  YEAR  1983  AND 
HSCAL  YEAR  1984  ACTIVITIES 


Part  I  describes  the  origins  and  objectives  of 
Project  A,  the  project's  organizational  structure,  and 
the  overall  design  of  the  research.  The  central 
activities  and  accomplishments  of  the  first  2  years 
are  then  summarized,  along  with  the  plan  for 
Integrating  these  materials  with  the  description  of 
the  project's  third  year  In  the  remainder  of  this 
report. 


Section  1 


ORIGINS  AND  FORMULATION  OF  PROJECT  A1 


"Project  A"  (Improving  the  Selection*  Classification,  and  Utilization 
of  Army  Enlisted  Personnel)  is  perhaps  the  largest  personnel  research  and 
development  project  ever  undertaken.  Its  general  purpose  Is  to  develop  an 
Improved  selectlon/classlflcatlon  system  for  all  entry-level  positions  In 
an  organization  that  annually  recruits  400,000-500,000  people,  selects 
100,000-120,000  of  them,  and  assigns  each  Individual  to  1  of  more  than  250 
job  classifications.  The  full  design  for  Project  A  covers  a  span  of  9 
years  and  we  now  have  completed  the  third  year. 

The  project  Is  so  large  that  It  could  not  be  executed  by  one  research 
organization  or  university  group.  Consequently,  the  "contractor".  Is  actu¬ 
ally  a  consortium  of  three  research  firms:  Human  Resources  Research  Orga¬ 
nization  (HumRRO)  of  Alexandria,  Virginia:  Personnel  Decisions  Research 
Institute  (PDRI)  of  Minneapolis,  Minnesota;  and  American  Institutes  for 
Research  (AIR)  of  Washington,  D.C.  The  contract  Is  administered  and  moni¬ 
tored  by  the  U.S.  Army  Research  Institute  for  the  Behavioral  and  Social 
Sciences  (ARI),  which  also  contributes  a  sizable  proportion  of  scientific 
and  technical  resources  to  the  project. 

A  parallel  effort  to  Project  A  Is  Project  B  (Development  of  a  Com¬ 
puterized  Personnel  Allocation  System).  Project  B  Is  responsible  for 
modeling  the  labor  supply  and  labor  demand  components  of  a  fully  func¬ 
tioning  personnel  allocation  system,  and  for  developing  the  computer 
algorithms  and  software  to  Integrate  Information  on  supply,  demand,  and 
classification  validity. 

If  both  Project  A  and  Project  B  are  successful,  the  final  product  will 
be  composed  of  the  following  elements: 

e  A  labor  supply  forecasting  model  and  procedures  for  estimating  the 
parameter  values  of  the  model. 

e  A  model  for  forecasting  the  Army's  long-  and  near-term  personnel 
needs  (labor  demand)  and  procedures  for  estimating  the  parameter 
values  of  the  model. 

•  A  new  set  of  selectlon/classlflcatlon  tests  which,  together  with 
the  Armed  Services  Vocational  Aptitude  Battery  (ASVA3; 8  optimize 
the  balance  between  the  costs  of  testing  and  the  gain  In 
classification  utility. 

•  A  metric  and  procedure  for  estimating  the  utility  of  performance 
within  and  across  jobs. 


iMuch  of  the  material  In  Section  1  Is  drawn  from  the  Project  A  annual 
report  for  the  1983  fiscal  year  (ARI  Research  Report  1347)  and  the  1984 
fiscal  year  (ARI  Research  Report  1393)  and  associated  documents. 


•  A  set  of  computerized  algorithms  (e.g.,  linear  programming)  that 
Integrates  demand  Information,  supply  information,  and  validity 

Information  In  such  a  way  that,  for  any  designated  period,  the 
overall  utility  of  personnel  assignments  is  maximized. 

All  of  this  Is  an  ambitious  undertaking.  The  following  report  is  a 
summary  of  the  first  3  years  of  Project  A's  contributions  to  the  effort  as 
well  as  a  detailed  report  of  activities  during  the  third  year. 


The  Selectlon/Classlflcatlon  System  for  Army  Enlisted  Personnel 


The  Current  System 

Each  year  more  than  100, 00C  new  recruits  are  selected,  classified, 
trained,  and  assigned  to  perform  the  hundreds  of  jobs  required  for  an 
effectlvo  Army.  The  system  currently  used  for  making  the  Initial  selection 
and  classification  decision  has  a  long  history.  The  development  of  the 
primary  selection  measure,  ASVAB  8/9/10,  can  be  traced  through  earlier 
forms— the  Army  Classification  Battery  (ACB),  the  Army  Qualification  Bat¬ 
tery  (AQB),  the  Armed  Forces  Qualification  Test  (AFQT),  the  Army  General 
Classification  Test  (AGCT)--back  to  the  original  Army  Alpha. 

To  be  qualified  for  Initial  enlistment  Into  the  Army  by  the  present 
system,  applicants  must  meet  a  number  of  eligibility  criteria,  Including 
age,  moral  standards,  physical  standards,  and  "trainablllty."  The  latter 
determination,  the  most  relevant  In  the  current  context,  Is  based  upon  a 
combination  of  two  sets  of  criteria;  scores  attained  on  the  ASVAD,  and 
educational  attainment. 

The  ASVAB  Is  currently  administered  as  an  entry  test  at  Military 
Entrance  Processing  Stations  (HEPS)  or  at  Mobile  Examining  Team  (MET) 
sites.  It  Is  also  administered  by  MET  to  high  school  juniors  and  seniors, 
Scores  from  this  test  are  used  for  guidance  counseling  and  are  also  pro¬ 
vided  to  Army  recruiters  as  a  means  of  Identifying  qualified  recruitment 
prospects.  In  addition  to  ASVAB,  non-high  school  graduates  ere  adminis¬ 
tered  a  short  biographical  questionnaire,  the  Military  Applicant  Profile 
(MAP),  which  has  been  found  to  be  a  useful  tool  for  identifying  the 
individuals  who  are  likely  to  be  poor  risks  In  terms  of  probability  of 
completing  Army  initial  entry  training. 

For  applicants  who  have  not  previously  taken  the  ASVAB  and  whose 
educational /mental  qualifications  appear  to  be  marginal  in  terms  of  the 
Army's  trainablllty  standards,  a  snort  enlistment  screening  test  may  be 
administered  to  assess  an  applicant's  prospects  of  passing  tne  ASVAB  test. 
Applicants  who  appear,  upon  Initial  recruiter  screening,  to  have  a  reason¬ 
able  prospect  of  qualifying  for  service  are  referred  either  to  a  MEP  site 
for  administration  of  the  ASVAB,  or  directly  to  a  MEPS.  MBPS  staff  com¬ 
plete  all  aspects  of  the  screening  process,  including  administration  of  the 
mental  and  physical  examination.  On  the  basis  of  the  information  assem¬ 
bled,  those  found  qualified  for  enlistment  are  classified  by  Military  Oc¬ 
cupational  Specialty  (MOS)  Bnd  assigned  to  a  particular  training  activity. 


About  fl02  of  Army  enlistees  enter  the  Army  under  a  specific  enlistment 
option  that  guarantees  choices  of  Initial  school  training,  career  field 

assignment,  unit  assignment,  or  geographical  area.  For  these  applicants, 
the  Initial  classification  and  training  assignment  decision  must  be  made 
prior  to  the  entry  Into  service.  This  Is  accomplished  at  the  HEPS  by  re¬ 
ferring  applicants  who  have  passed  the  basic  screening  criteria  (mental, 
physical,  moral)  to  an  Army  guidance  counselor,  whose  responsibility  It  Is 
to  match  the  applicant's  qualifications  and  preferences  to  the  Army's 
current  skill  training  requirements,  and  to  "make  reservations"  for  train¬ 
ing  assignments,  consistent  with  the  applicant's  enlistment  option. 

The  classification  and  training  reservation  procedure  Is  accomplished 
by  the  Recruit  Quota  Systom  (REQUEST),  which  was  Implemented  In  1973. 
REQUEST  Is  a  computer-based  system  that  coordinates  the  Information  needed 
to  reserve  training  slots  for  volunteers.  One  major  limitation  Is  that 
REQUEST  uses  minimum  qualifications  for  accessions  control.  Thus,  to  the 
extent  that  an  applicant  may  minimally  qualify  for  a  wide  range  of  courses 
or  specialties,  based  on  aptitude  test  scores,  the  Initial  classification 
decision  Is  governed  by  (a)  his  or  her  own  stated  preference  (often  based 
upon  limited  knowledge  about  the  actual  job  content  and  working  conditions 
of  the  various  military  occupations),  (b)  the  availability  of  training 
slots,  and  (c)  priorities/needs  of  the  Army.  Numerous  procedures  for 
Improving  the  system  are  under  development.  These  include  "MOS  Match 
Module"  and  the  previously  mentioned  Project  0  Computerized  Personnel 
Allocation  System,  as  well  as  other  smaller  efforts. 

This  review  of  current  practice  suggests  that  the  present  selection 
and  classification  procedures  could  be  Improved  by  taking  advantage  of 
recent  technological  advances  and  developments  In  decision  theory.  There 
Is  a  need  for  developing  a  formal  decision-making  procedure  that  Is  aimed 
at  maximizing  the  overall  utility  of  the  classification  outcomes  to  the 
Army.  However,  this  decision  process  must  allow  for  the  potentially 
adverse  Impacts  on  recruitment  If  the  enlistee's  Interests,  work  values, 
and  preferences  are  not  given  sufficient  consideration.  There  are  clear 
trade-offs  that  must  be  evaluated  between  the  procedures  necessary  to  (a) 
attract  qualified  people,  and  (b)  put  them  into  the  right  slots. 


Modifications  Needed  In  the  Current  System 

The  current  Army  personnel  system  has  a  number  of  features  that  must 
be  addressed  In  Project  A: 

1.  Current  selection  measures  cover  a  fairly  limited  range  of 
Individual  characteristics.  The  ASVAD  is  an  excellent  measure  of 
general  cognitive  abilities.  However,  in  addition,  there  Is  a 
need  for  developing  potentially  relevant  non-cognltlve  measures, 
such  as  psychomotor /perceptual  abilities,  vocational  Interests, 
and  biographical  Indexes,  and  determining  their  usefulness  In 
predicting  aspects  of  Army-wide  and  MOS-speciflt  performance. 

2.  No  measures  of  job  performance  that  can  be  used  as  criterion 
measures  In  validation  research  are  available.  Current  measures 
of  job  proficiency  (Skill  Qualification  Test S--SQT )  are  designed 


primarily  as  diagnostic  training  tools  rathe  than  as  standardized 
procedures  for  performance  appraisal. 
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3.  The  available  Information  on  selection  and  classification  validity 
Is  based  on  the  relationship  of  entrance  tests  to  performance  In 
training,  not  performance  on  the  job. 

4.  The  Army  does  not  have  the  data  system  necessary  to  make  critical 
personnel  decisions  throughout  a  soldier's  lifecycle  on  the  basis 
of  accumulating  Information  about  the  job  performance  of  the 
Individual  and  the  needs  and  priorities  of  the  Army. 

5.  Currently,  If  an  applicant  chooses  a  specific  training  program  and 
meets  the  minimum  aptitude  requirements,  he  or  she  Is  placed  Into 
that  training  If  an  opening  exists.  This  procedure  does  not  take 
Into  account  where  that  Individual  could  best  serve  the  needs  of 
the  Army  or  even  where  that  Individual  could  be  most  successful  In 
the  Army. 

6.  The  Army  does  not  have  an  efficient  means  of  expressing  needs  and 
policies  In  terms  of  personnel  goals,  constraints,  and  trade-offs. 
An  adaptive,  self-adjusting  system  that  can  more  fully  support 
management  decision  making  Is  needed. 

These  characteristics  of  the  current  system  stem  primarily  from  the 
dynamics  In  the  labor  market,  the  new  requirements  produced  by  emerging 
weapon  systems,  and  the  Inevitable  lag  of  an  operational  system  behind  the 
most  recent  technological  advances  In  testing  and  personnel  decision  making. 

Origins  of  the  Project 

In  response  to  needs  expressed  by  the  Army  and  by  Congress,  as  well  as 
the  previously  mentioned  professional  considerations,  ARI  began  In  1980  to 
develop  a  major  new  research  and  development  (R&D)  program  In  personnel 
selection,  classification,  and  allocation.  The  basic  requirement  was  to 
demonstrate  the  validity  of  the  ASVAB  as  a  predictor  of  both  training  and 
on-the-job  performance. 

While  ARI  staff  were  systematically  reviewing  that  requirement,  the 
concept  of  a  larger  project  began  to  emerge.  With  only  a  moderate  amount  of 
additional  resources,  new  predictors  In  the  perceptual,  psychomotor,  Inter¬ 
est,  temperament,  and  biodata  domains  could  be  evaluated  as  well.  A  longi¬ 
tudinal  research  data  base  could  be  developed  to  accumulate  Information  on  a 
variety  of  predictor/criterion  relationships  from  enlistment,  through  train¬ 
ing,  first-tour  assignments,  reenlistment  decisions,  and  for  some,  to  their 
second  tour.  Also,  the  data  could  be  the  basis  for  making  near- real -time 
decisions  on  the  best  match  between  characteristics  of  an  Individual  enlistee 
or  reenlistee  and  the  requirements  of  available  Army  Military  Occupational 
Specialties  (MOS). 

To  address  the  selection  and  classification  portion  of  the  effort, 
solicitation  MDA  903-8.1-12-R-0158,  "Project  A:  Development  and  Validation  of 
Army  Selection  and  Classification  Measures,"  was  issued  21  October  1981. 
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This  document  Is  the  "official''  starting  point  of  Project  A.  The  solicita¬ 
tion  initially  outlined  a  7-year  program  designed  to  provide  the  Information 
necessary  for  implementing  a  state-of-the-art  selection  and  classification 
system  for  all  U.S.  Army  enlisted  personnel. 

While  the  contract  Statement  of  Work  (SOW)  and  the  Request  for  Proposals 
(RFP)  were  being  developed,  certain  structural  changes  were  made  within  AR'I 
to  accommodate  the  project.  A  new  manpower  and  personnel  laboratory  was 
created  with  Joyce  L.  Shields  as  director,  and  a  selection  and  classification 
technical  area  was  established  headed  by  Newell  K.  Eaton.  To  execute  the  In- 
house  research  and  to  monitor  the  contract.  It  was  also  necessary  to  recruit 
additional  professional  staff. 

In  response  to  the  RFP,  the  Human  Resources  Research  Organization, 
American  Institutes  for  Research,  and  Personnel  Decisions  Research  Institute, 
formed  a  consortium  to  develop  a  research  proposal  for  Project  A.  HumRRO 
assumed  the  responsibility  as  the  prime  contractor,  and  the  consortium's 
proposal  was  submitted  In  January  1982.  The  contract  was  awarded  to  the 
HumRRO-AIR-PDRI  consortium  30  September  1982. 


Specific  Objectives  of  Pro A 


The  project  has  two  principal  kinds  of  objectives.  The  first  type  per¬ 
tains  to  the  operational  needs  of  the  Army.  They  constitute  the  basic  pur¬ 
poses  for  which  the  project  Is  funded  and  supported.  Specifically,  Project  A 
Is  to: 


1.  Develop  new  ‘measures  of  job  performance  that  can  be  used  as  cri¬ 
teria  against  which  to  validate  selectlon/classlflcatlon  measures. 
The  new  criterion  measures  will  use  a  variety  of  methods  to  assess 
both  job-specific  measures  of  tusk  performance  and  general  perfor¬ 
mance  factors  that  are  not  job  specific. 

2.  Validate  existing  selection  measures  against  both  existing  and 
project-developed  criteria. 

3.  Develop  and  validate  new  and/or  Improved  selection  and  classifica¬ 
tion  measures. 

4.  Validate  proximal  criteria,  such  as  performance  In  training,  as 
predictors  of  later  criteria,  such  as  job  performance  ratings,  so 
that  more  Informed  decisions  about  reassignment  and  promotion  can 
be  made  throughout  the  Individual's  tour. 

5.  Determine  the  relative  utility  to  the  Army  of  different  performance 
levels  across  MOS. 

6.  Estimate  the  relative  effectiveness  of  alternative  selection  and 
classification  procedures  In  terms  of  their  validity  and  utility 
for  making  operational  selection  and  classification  decisions. 

A  second  set  of  objectives  has  to  do  with  questions  of  a  more  scientific 
nature.  This  second  set  of  questions  Is  being  addressed  with  essentially  the 
same  data  as  the  first.  That  is,  the  project  does  not  have  two  parts  with 


one  having  to  do  with  basic  research  and  the  other  focused  on  applied  re¬ 
search.  Instead,  the  scope  of  the  project  and  the  attempt  to  consider  a r 
entire  system  at  .one  time  make  It  possible  to  concurrently  address  a  number 

of  more  basic  research  objectives.  Some  of  these  are  as  follows: 

1.  Identify  the  basic  variables  (constructs)  that  constitute  the 
universe  of  Information  available  for  selectlon/classlflcatlon  into 
entry-level-skilled  Jobs. 

2.  Develop  a  comprehensive  model  of  performance  for  entry-level- 
skilled  jobs  that  Incorporates  both  a  theoretical  latent  structure 
and  linkages  to  state-of-the-art  measurement. 

3.  Describe  the  utility  functions  and  the  utility  metrics  that  indi¬ 
viduals  actually  use  when  estimating  "utility  of  performance." 

4.  Describe  the  degree  of  differential  prediction  across  (a)  major 
domains  of  abilities,  personality,  Interests,  and  personal  history, 
(b)  major  factors  of  job  performance,  and  (c)  different  types  of 
jobs.  The  project  will  collect  a  large  sample  of  Information  from 
each  of  these  three  populations  (l.e.,  Individual  differences, 
performance  factors,  and  jobs). 

5.  Determine  the  extent  of  differential  prediction  across  racial  and 
gender  groups  for  a  systematic  sample  of  individual  differences, 
performance  factors,  and  jobs. 

6.  Develop  new  statistical  estimators  of  classification  efficiency. 

Each  of  the  above  objectives,  both  applied  and  basic,  breaks  down  into  a 
number  of  more  specific  questions  that  will  be  touched  on  In  later  sections. 


Project  A  Organization 

Task  Structure 

For  purposes  of  an  orderly  division  of  labor,  Project  A  Is  organized 
Into  five  major  research  tasks: 

Task  1.  Data  Base  Management  and  Data  Analysis.  Task  1  has  two  major 
components.  The  first  component  deals  with  designing,  generating,  and  main¬ 
taining  the  data  base.  By  the  end  of  the  project  the  data  base  will  contain 
several  hundred  thousand  records  taken  from  three  Army  troop  cohorts,  three 
major  validation  samples,  and  numerous  pilot  samples.  The  second  component 
Is  concerned  with  providing  the  analytic  capability  for  (a)  analyzing  field 
test  and  validation  data  and  (b)  evaluating  the  existing  set  of  predictors 
against  the  new  performance  measures,  to  determine  whether  the  new  predictors 
have  Incremental  validity  over  and  above  the  present  system.  These  two 
components  must  be  accomplished  using  state-of-the-art  technology  In  methods 
for  analyzing  personnel  selection  research  data. 

Task  2,  Development  of  Predictors  of  Job  Performance.  To  date,  a  large 
proportion  of  the  efforts  of  the  Armed  Services  in  this  area  has  been  concen¬ 
trated  on  Improving  the  ASVAB,  which  Is  now  a  well-researched,  val Id  'measure 


of  genera)  cognitive  abilities.  However,  many  critical  Army  tasks  depend 
on  psychomotor  and  perceptual  skills  for  their  successful  performance. 
Further,  neither  biodata  nor  motivational  variables  are  now  comprehensively 
evaluated.  It  is  perhaps  in  these  four  non-cognltive  domains  that  the 
greatest  potential  for  adding  valid  independent  dimensions  to  current  clas¬ 
sification  Instruments  Is  to  be, found.  The  objectives  of  Task  2  are  to 
develop  a  broad  array  of  new  and  Improved  selection  measures  and  to  ad¬ 
minister  them  to  three  major  validation  samples.  A  critical  aspect  of  this 
task  Is  the  demonstration  of  the  Incremental  validity  added  by  new 
predictors. 

Task  3,  Measurement  of  School/Training  Success.  The  objective  of 
Task  3  Is  to  derive  school  and  training  performance  Indexes  that  can  be 
used  (a)  as  criteria  aaalnst  which  to  validate  the  Initial  predictors,  and 
(b)  as  predictors  of  later  job  performance.  Comprehensive  job  knowledge 
tests  will  be  developed  for  the  sample  of  MOS  Investigated,  and  their 
content  and  construct  validity  will  be  determined.  Two  additional  purposes 
for  developing  training  criterion  measures  are  to  determine  the  relation¬ 
ship  of  training  performance  to  job  performance  and  to  find  out  whether 
validating  against  each  of  these  kinds  of  criteria  selects  the  same  or 
different  predictor  measures. 

Task  4.  Assessment  of  Army-Wide  Performance,  In  contrast  to 
performance  measures  that  may  be  developed  for  a  specific  Army  MOS,  Task  4 
will  develop  measures  that  can  be  used  across  all  MOS  (l.e.,  Army-wide). 
The  Intent  Is  to  develop  measures  of  first-  and  second-tour  job  performance 
against  which  all  Army  enlisted  personnel  may  be  assessed.  A  major  objec¬ 
tive  for  Task  4  Is  to  develop  a  model  of  soldier  effectiveness  that  speci¬ 
fies  the  major  dimensions  of  an  individual's  contribution  to  the  Army  as  an 
organization.  Another  important  objective  of  Task  4  Is  to  develop  mea¬ 
sures  of  performance  utility. 

Task  5.  Development  of  MOS-SpecIflc  Performance  Measures.  The  focus 
of  Task  5 is  the  development  or  reliable  and  valid  measures  of  specific  job 
task  performance  for  a  selected  set  of  MOS.  This  task  may  be  thought  of  as 
having  three  major  components:  Job  analysis,  construction  of  job  perfor¬ 
mance  measures,  and  validation  of  the  constructs  for  the  new  measures. 
While  only  a  subset  of  MOS  will  be  analyzed  during  this  project,  the  Army 
may  in  the  future  wish  to  develop  job  performance  measures  for  a  larger 
number  of  MOS.  For  this  reason,  it  is  Intended  that  the  methods  used  will 
apply  to  all  Army  MOS. 

In  addition,  Task  6  deals  with  administrative  management  of  the 
project. 

The  Consort ium/AR I  Organization 


The  initial  project  organization  Is  depicted  in  Figure  1,1.  The 
principal  consortium  investigators  are  shown,  with  their  respective 
organizations,  in  the  lower  row.  The  principal  ARI  staff  are  shown  In  the 
upper  row.  Within  the  project,  consortium  and  ARI  investigators  undertake 
both  independent  and  joint  research  activities.  ARI  staff  also  have  the 
administrative  role  of  contract  oversight. 
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Figure  1.1.  Project  A  organization  as  of  30  September  1983. 


During  the  first  3  years,  technical  and  management  oversight  has  been 
the  responsibility  of  Newell  K.  Eaton,  the  contracting  officer's  technical 

representative  (COR).  He  has  been  the  ARI  principal  scientist,  with  the 
responsibility  for  technical  review  and  guidance.  Consortium  management  has 
been  the  responsibility  of  Marvin  H.  Goer,  the  managing  project  director. 
Within  the  consortium,  John  P.  Campbell  has  been  the  principal  scientist 

responsible  for  overall  scientific  quality.  Robert  Sadacca  has  been  the 
assistant  for  technical  planning  and  research  design.  James  Harris  has  been 
primarily  responsible  for  the  day-to-day  coordination  of  the  project's 

multiple  activities. 

The  Advisory  Group  Structure 

To  ensure  that  Project  A  Is  consistent  with  other  ongoing  research 
programs  being  conducted  by  the  other  armed  services,  a  mechanism  was  needed 
for  maintaining  close  coordination  with  the  other  military  departments,  as 
well  as  with  the  Department  of  Defense.  A  procedure  also  was  needed  to 

assure  that  the  research  program  Is  technically  sound,  both  conceptually  and 
methodologically.  Finally,  a  method  was  needed  to  receive  feedback  on 
priorities  and  objectives,  as  well  as  to  Identify  current  problems  before 
they  become  too  large  to  fix. 

The  method  used  to  meet  these  needs  was  to  establish  a  series  of 
advisory  groups.  Figure  1.2  shows  the  structure  and  membership  of  the 
Governance  Advisory  Group,  which  is  comprised  of  the  Scientific  Advise ry 


Figure  1.2.  Governance  Advisory  Group  as  of  30  September  1983. 


Group  (SAG),  Army  Advisory  Group  (GAG),  and  Interservice  Advisory  Croup 

(I SAG)  components. 

The  SAG  comprises  nationally  recognized  authorities  in  psychometrics, 
experimental  design,  sampling  theory,  utility  analysis,  and  applied  research 
In  selection  and  classification,  and  In  the  conduct  of  psychological  research 
In  selection  in  the  Army  environment.  The  ISAG  comprises  the  Laboratory 
Directors  for  applied  psychological  research  In  the  Army,  Air  Force,  and 
Navy,  and  the  Director  of  Accession  Policy  from  the  DoD  Office  of  Assistant 
Secretary  of  Defense  for  Manpower  and  Reserve  Affairs. 

The  GAG  Includes  representatives  from  the  Office  of  Deputy  Chief  of 
Staff  for  Personnel  (DCSPER),  Office  of  Deputy  Chief  of  Staff  for  Operations 
(DCSOPS),  Training  and  Doctrine  Command  (TRADOC},  Forces  Command  (FORSCOM), 
and  ll.S.  Army  Europe  (USAREUR).  These  senior  officers  have  a  significant 
Interest  In  the  project  planning  and  priorities.  They  also  represent  the 
elements  that  provide  the  necessary  and  substantial  troop  support. 


The  Research  Plan  and  Integrated  Master  Plan 

The  first  6  months  of  the  project  were  spent  planning,  documenting, 
reviewing,  modifying,  and  redrafting  research  plans,  troop  support  requests, 
administrative  support  plans,  and  budgetary  plans,  as  well  as  executing 
Initial  research  efforts.  Drafts  of  the  plans  were  provided  to  the  SAG  and 
ISAG.  Their  comments,  provided  orally  during  meetings  and  subsequently 
written  In  response  to  draft  documents,  were  Incorporated  in  the  research 
plan. 

The  culminating  review  was  conducted  In  April  1983  by  the  Army  Advisory 
Group,  with  representatives  from  the  Scientific  and  Interservice  Advisory 
Groups.  In  that  meeting  the  advisors  reviewed  the  entire  research  program, 
research  design,  sampling  strategy,  main  cohort  and  focal  MOS  recommendation, 
and  troop  support  Implications.  They  Incorporated  changes  to  reduce  the 
troop  support  burden  and  distribute  It  more  equitably  among  the  three  par¬ 
ticipating  commands  (FORSCOM,  TRADOC,  USAREUR).  All  three  components  of  the 
Governance  Advisory  Group  endorsed  the  research  program. 

In  May  1983,  ARI  issued  ARI  Research  Report  1332,  Improving  the 
Selection,  Classification,  and  Utilization  of  Army  Enlisted  Personnel -- 
Project  A; "  Research  Plan.  In  June  isih,  the  Project  A:  Integrated  Master 
PTarr  (HumRRO  FR-PftD-83-8)  was  Issued,  providing  detailed  budget  alloca¬ 
tions,  schedules,  and  specifications  of  contract  deliverables. 


Summary  of  Research  Design  and  Sample  Selection 

The  overall  design  of  Project  A  is  described  in  detail  in  the  Master 
Research  Plan  (June  1983).  Again,  the  overall  objectives  are  to  develop  and 
validate  an  experimental  battery  of  new  and  improved  selection  measures 
against  a  comprehensive  array  of  job  performance  and  training  criteria.  The 
validation  research  must  produce  sample  estimates  of  the  parameters  necessary 
to  implement  a  computerized  selection  and  classification  system  for  all 
first-tour  enlisted  MOS. 
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Research  Design 


To  meet  these  objectives,  a  design  was  developed  that  uses  two  predic¬ 
tive  and  one  concurrent  validation  on  two  major  troop  cohorts  (FY83/84 
accessions  and  FY86/87  accessions),  and  one  file  data  validation  on  the 
FY81/82  cohort.  That  Is,  in  addition  to  collecting  data  from  new  samples, 
the  project  Is  making  use  of  existing  file  data  that  have  been,  or  can  be, 
accumulated  for  1981  and  1982  accessions.  A  schematic  of  the  data  collection 
plan  Is  shown  in  Figure  1.3. 

The  logic  of  the  design  Is  straightforward.  Existing  file  data  on  the 
FY87/82  cohort  provided  the  first  opportunity  to  revalidate  the  ASVAB  against 
existing  training  criteria  and  against  the  SQT.  As  described  In  a  separate 
report  (McLaughlin,  Rossmelssl,  Wise,  Brandt,  &  Wang,  1984),  and  summarized 
later  In  this  report,  the  results  of  the  analyses  of  FY81/82  file  data  were 
used  to  suggest  operational  changes  In  ASVAB  composites.  The  file  sample 
consisted  of  approximately  90,000  records  distributed  over  120  M0S  In 
sufficient  numbers  to  permit  analysis.  The  FY81/82  data  also  provide  a 
benchmark  against  which  to  compare  the  additional  validation  data  to  be 
collected. 

The  FY83/84  cohort  provided  the  first  opportunity  to  obtain  validation 
data  using  new  predictor  tests  and  new  performance  measures.  Two  samples 
have  been  taken  from  this  cohort.  First,  a  "preliminary"  predictor  battery 
of  predominately  off-the-shelf  tests  chosen  to  represent  major  constructs  was 
administered  to  soldiers  In  four  MOS  (31C,  19E/K,  63B,  71L)  as  they  entered 
the  Army  during  the  last  half  of  FY83  and  the  first  half  of  FY84.  A  total  of 
11,000  personnel  In  the  four  MOS  were  tested.  Besides  looking  at  the  rela¬ 
tionship  of  the  Preliminary  Battery  constructs  to  the  existing  ASVAB,  we 
followed  a  portion  of  this  sample  during  the  summer  and  fall  of  1985  with  a 
broad  array  of  criterion  measures  (described  later).  The  follow-up  of  the 
Preliminary  Battery  sample  was  part  of  a  much  larger  Concurrent  Validation 
sample  drawn  from  1985  job  Incumbents  who  entered  the  Army  during  FY83/84. 

Results  from  the  administration  of  the  preliminary  predictor  battery 
sample  (described  later  under  predictor  development)  were  used  to  help 
develop  the  trial  predictor  battery  for  use  In  the  major  Concurrent  Vali¬ 
dation  during  ERe  summer  and  fall  of  1985.  Immediately  prior  to  the 
Concurrent  Validation,  all  predictors  and  all  criterion  measures  were  put 
through  a  series  of  field  tests.  For  example,  all  criterion  measures  were 
field  tested  on  approximately  150  Incumbents  In  each  of  nine  MOS.  The  test 
battery  used  during  the  predictor  field  tests  was  labeled  the  Pilot  Trial 
Battery.  Both  the  Preliminary  Battery  sample  and  the  field  tests  Were  ui&d 
to  develop  the  Trial  Battery  for  use  In  the  Concurrent  Validation. 


The  Trial  Battery  Is  being  validated  in  a  sample  of  19  MOS  against  an 
array  of  newly  developed  training  and  job  performance  measures.  For  each 
MOS,  500-700  Incumbents  are  belno  tested.  As  noted  above,  a  subset  of  the 
Concurrent  Validation  sample  took  the  Preliminary  Battery  approximately  IP 
months  earlier,  which  will  permit  a  longitudinal  validation  of  the  off-the- 
shelf  tests  that  were  selected  to  represent  major  ability  and  personality 
constructs. 
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Figure  1.3.  The  overall  data  collection  plan. 


Analysis  of  the  Trial  Battery  data  will  result  In  further  revision  of 
the  predictor  battery.  The  revised  version  will  then  be  called  the  Experl" 
manta!  Battery,  which  will  be  used  with  a  longitudinal  validation  sample 
selected  from  people  who  enter  the  Army  In  FY86  and  FY87.  The  Experimental 
Battery  will  be  administered  at  the  time  of  entry  to  approximately  50,000 
people  distributed  across  19  MOS.  The  training  measures  will  be  administered 
at  the  conclusion  of  each  Individual's  Advanced  Individual  Training  (AIT) 
course  and  the  job  performance  criterion  data  will  be  collected  approximately 
18  months  later.  In  addition,  both  general  (Army-wide)  and  job  (MOS)- 
speclflc  performance  measures  will  be  developed  and  administered  to  the 
surviving  members  of  both  the  FY83/84  and  FY86/87  cohort  samples  during  their 
second  tour  of  duty.  Consequently,  for  both  these  samples  tne  design  Is  also 
a  longitudinal  one. 


Sample  Selection 

The  overall  objective  In  generating  the  samples  has  been  to  maximize  the 
validity  and  reliability  of  the  Information  to  be  gathered,  while  at  the  same 
time  minimizing  the  time  and  costs  Involved.  In  part,  costs  are  a  function 
of  the  numbers  of  people  In  the  sample.  However,  costs  are  also  Influenced 
by  the  relative  difficulty  Involved  In  locating  and  assembling  the  people  In 
a  particular  sample,  by  the  degree  to  which  the  unit's  operations  are  dis¬ 
rupted  by  the  data  collection,  by  the  staff  costs  Involved  In  collecting  the 
data  In  a  particular  manner,  and  by  other  such  factors. 

The  sampling  plan  Itself  Incorporated  two  principal  considerations. 
First,  a  sample  of  MOS  was  selected  from  the  universe  of  possible  MOS;  then, 
the  required  sample  sizes  of  enlisted  personnel  (EP)  within  each  MOS  were 
specified.  The  MOS  are  the  primary  sampling  units.  This  design  Is  necessary 
because  Project  A  Is  developing  a  system  for  a  population  of  Jobs  (MOS),  but 
only  a  sample  of  MOS  can  be  studied. 

Urge  and  representative  .samples  of  enlisted  personnel  within  each 
selected  MOS  are  Important  because  stable  statistical  results  must  be 
obtained  for  each  MOS.  There  is  a  trade-off  In  the  allocation  of  project 
resources  between  the  number  of  MOS  researched  and  the  number  of  subjects 
tested  within  each  MOS:  The  more  MOS  are  Investigated,  the  fewer  subjects 
per  MOS  can  be  tested,  and  vice  versa,  Cost  versus  statistical  reliability 
considerations  dictated  that  19  MOS  could  be  studied. 

To  samples  from  all  If)  MOS  we  have  administered  the  new  predictors  (from 
Task  2)  and  collected  the  school  and  Army-wide  performance  data  (of  Tasks  3 
end  4).  For  nine  of  these  MOS,  we  have  also  administered  the  MOS-specIflc 
performance  measures  developed  In  Task  5.  The  nine  MOS  were  chosen  to  pro¬ 
vide  maximum  coverage  of  the  total  array  of  knowledge,  ability,  and  skill 
requirements  of  Army  jobs,  given  certain  statistical  constraints. 


MOS  Selection 

The  selection  of  the  sample  of  19  MOS  proceeded  through  a  series  of 
stages.  The  guidelines  that  follow  were  used  to  draw  an  Initial  sample  of 
MOS. 


•  High-density  MOS  that  would  provide  sufficient  sample  sizes  for 
statistically  reliable  estimates  of  new  predictor  validity  and 
differential  validity  across  racial  and  gender  groups. 

e  Representative  coverage  of  the  aptitude  areas  measured  by  the  ASVAB 
area  composites. 

•  High-priority  MOS  (as  rated  by  the  Army2  In  the  event  of  a 

national  emergency). 

e  Representation  of  the  Army's  designated  Career  Management  Fields 

(CMF). 

e  Representation  of  the  jobs  most  crucial  to  the  Army's  mission  (e.g., 
the  combat  specialties). 

This  set  of  19  MOS  represented  19  of  the  Army's  30  Career  Management 
Fields  (CMF).  Of  the  11  CMF  not  represented,  2  (CMF  96  and  98)  are  clas¬ 
sified,  2  (CMF  33  and  74)  had  fewer  than  500  FY81  accessions,  and  7  (CMF  23, 

28,  29,  79,  81,  84,  and  74)  had  fewer  than  300  FY81  accessions.  The  Initial 
set  Includes  only  5*  of  Army  Jobs  but  44%  of  the  soldiers  recruited  In  FY81. 

Similarly,  of  the  15%  women  In  the  1981  cohort,  44%  are  represented  In 
the  sample;  of  the  27%  blacks,  44%  are  represented  In  the  sample;  and  of  the 
5%  Hispanic,  43%  are  represented.  Although  female  and  minority  representa¬ 
tion  Is  high  absolutely,  relatively  It  remains  about  the  same  as  in  the 
population. 

Mine  of  the  19  MOS  were  earmarked  for  the  Job-specific  performance 
measurement  phase  of  the  project.  These  were  selected,  as  a  subset,  with  the 
same  general  criteria  used  fn  Identifying  the  parent  list  of  19.  Since  the 
larger  list  Is  composed  of  5  combat  and  14  noncombat  MOS,  It  seemed  reason¬ 
able  that  these  categories  be  proportionally  represented  In  the  subset  of  9. 
Consequently,  the  9  MOS  designated  for  Job-specific  performance  measurement 
development  are! 


(1) 

1 1B 

-  Infantryman 

*  (2) 

13B 

-  Cannon  Crewman 

(3) 

19E/K 

-  Tank  Crewman 

(4) 

05C 

-  Radio  TT  Operator 

(5) 

63B 

-  Vehicle  and  Generator  Mechanic 

*  (6) 

64C 

-  Motor  Transport  Operator 

*  (7) 

71L 

-  Administrative  Specialist 

(8) 

9TB 

-  Medical  Care  Specialist 

*  (9) 

9  SB 

-  Military  Police 

An  Initial  batch  of  four  (designated  on  the  list  by  asterisks)  was  selected 
and  termed  Batch  A;  the  other  five  are  Batch  B.  Work  was  begun  first  on 
Batch  A  and  then  on  Batch  B. 


200C$0PS  (DAM0-0DM),  Of,  2  Jul  82,  Subject:  IRR  Training  Priorities. 


Refinements  of  the  MOS  sample  Included  a  cluster  analysis  of  expert 
ratings  of  MOS  similarity  and  a  tevlew  of  the  Initial  sample  by  the 

Governance  Advisory  Group. 


MOS  Cluster  Analysis 

To  obtain  data  for  empirically  clustering  MOS  on  the  basis  of  their  task 
content  similarity,  a  brief  job  description  was  generated  for  each  of  111  MOS 
from  the  job  activities  described  In  AR  611  -201 3 .  The  sample  of  ill  MOS 
represents  47#  of  tne  population  of  238  Skill  Level  1,  Active  Army  MOS  with 
conventional  ASVAB  entrance  requirements.  It  Includes  the  84  largest  MOS 
(300  or  more  new  job  Incumbents  yearly),  plus  an  additional  27  selected 
randomly  but  proportionately  by  CMF.  Each  job  description  was  limited  to  two 
sides  of  a  5x7  card. 

Members  of  the  contractor  research  staff  and  ARI  Army  officers— 
approximately  25  In  all-served  as  expert  judges  and  were  given  the  task  of 
sorting  the  sample  of  111  job  descriptions  Into  homogeneous  categories  based 
on  perceived  similarities  and  differences  In  job  activities  as  described  In 
AR  611-201.  Data  from  the  similarity  scaling  task  were  used  to  cluster 
analyze  the  matrix  of  similarities  for  the  111  jobs. 

The  results  were  used  to  check  the  representativeness  of  the  Initial 
sample  of  19  MOS.  That  Is,  did  the  Initial  sample  of  MOS  Include  repre¬ 
sentatives  from  all  the  major  clusters  of  MOS  derived  from  the  similarity 
scaling?  On  the  basis  of  these  results,  and  guidance  received  from  the 
Governance  Advisory  Group,  two  MOS  that  had  been  selected  Initially  (62E  and 
31M)  were  replaced  by  MOS  51B  and  MOS  27E,  which  are  In  the  same  CMF  and 
Involve  the  same  Aptitude  Area  Composites  as  the  replaced  MOS.  The  sample  of 
MOS  resulting  from  the  above  procedures.  Is  shown  In  Table  1,1. 

The  next  two  sections  of  this  report  summarize  In  somewhat  more  detail, 
the  project's  activities  for  FY83  (year  one)  and  FY84  (year  two). 


3ftrmy  Regulation  611-201,  Enlisted  Career  Management  Fields  and  Military 
Occupational  Specialties. 


Table  1.1 .  Project  A  Military  Occupational  Specialties  (MOS) 
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'Weighted  average  of  Traines  Projections  (3  months  of  FY83  and  9  months  of  FY84)  adjusted  for  expected  school  attrition  (actual  FY81  rates). 


Section  2 


OVERVIEW  OF  FISCAL  YEAR  1983 


During  the  first  year  of  the  project,  detailed  plans  were  prepared,  the 
sample  of  focal  MCS  was  selected,  the  sample  sizes  required  from  each  were 
specified,  and  work  was  begun  on  the  comprehensive  predictor  and  criterion 
development  that  would  be  the  basis  for  the  later  validation.  In  addition, 
the  available  computer  file  data  on  the  FY81/P2  cohort  were  merged  from  the 
various  sources,  edited  thoroughly,  and  prepared  for  analysis. 

Plans  for  the  project  as  a  whole  and  activities  during  the  first  year 
were  described  In  the  Annual  Report  for  the  1983  fiscal  year  (ARI  Research 
Report  1347)  and  the  technical  appendix  to  that  report  (ARI  Research  Note 
83-87),  both  published  In  October  1983. 


Planning  Activities 

In  general,  as  previously  noted,  much  of  the  first  year's  effort  was 
taken  up  by  an  Intensive  period  of  planning,  briefing  tne  advisory  groups, 
preparing  the  Initial  troop  requests,  and  related  activities. 

The  requirement  for  a  detailed  research  plan  to  be  produced  during  the 
first  6  months  of  the  contract  was  Included  In  the  RFP.  Hindsight  judges  It 
to  be  an  even  more  valuable  step  than  the  authors  of  the  RFP  might  have  had 
In  mind.  The  research  staff  devoted  a  great  deal  of  effort  to  the  writing  of 
the  research  plan,  and  It  was  carefully  reviewed  by  the  advisory  groups  and 
by  the  ARI  professional  staff.  Revisions  were  then  made,  and  the  completed 
plan  was  published  In  May  1983  under  the  joint  authorship  of  the  contractor 
and  ARI  staffs. 

The  Research  Plan  and  the  accompanying  Master  Plan  lay  out,  In  detail, 
the  specific  steps  to  be  taken  In  each  subtask  In  the  project,  the  schedule 
to  be  followed,  and  the  budget  allocations  to  be  made  to  each  subtask  during 
each  contract  period.  These  two  documents  have  become  the  blueprint  for  the 
project.  They  have  also  proven  Invaluable  as  a  mechanism  for  developing  a 
consensus  and  facilitating  communication  among  contractor  staff  and  between 
the  contractor  and  ARI . 

The  detailed  planning  and  review  that  went  Into  the  development  of  the 
Research  Plan  and  Master  Plan  made  It  possible  to  specify,  clearly  and 
precisely,  the  troop  support  the  project  would  need  during  Its  first  2 
years.  Consequently,  the  project  staff  has  experienced  relatively  little 
difficulty  In  communicating  project  needs  to  the  appropriate  Army  organi¬ 
zations  and  in  gaining  their  support.  The  cooperation  we  have  received  has 
been  outstanding. 


Criterion  Development 

During  FY33  the  development  of  performance  measures  proceeded  through 

the  steps  described  below. 


HOS  Task  Descriptions 

Because  the  Information  had  not  been  generated  for  personnel  research 
purposes,  the  Army's  MOS  job  analysis  data  needed  considerable  modification 
before  they  could  be  used  by  Project  A  for  criterion  development.  Conse¬ 
quently,  a  great  deal  of  effort  In  FY83  was  devoted  to  refining  and  Integrat¬ 
ing  task  descriptions  from  Soldier's  Manuals  and  from  the  Comprehensive  Data 
Analysis  Program  (COOAP)  occupational  survey  questionnaires.  For  each  MOS,  a 
data  bank  of  task  statements  was  accumulated  from  all  available  sources,  and 
the  Individual  task  statements  were  edited  to  determine  If  they  Indeed 
focused  on  observable  job  tasks,  If  they  were  redundant  or  overlapped  with 
other  tasks,  and  If  they  were  at  the  same  level  of  generality.  Subject 
matter  experts  (SMEs)  were  consulted  to  determine  whether  the  edited  pool  of 
task  descriptions  provided  a  complete  picture  of  the  MOS  content.  The  SMEs 
also  judged  the  relative  criticality  of  each  task. 

The  resulting  task  descriptions  provided  the  principal  basis  for  the 
development  of  hands-on  performance  measures  and  job  knowledge  tests. 


Assessment  of  Training  Performance 

A  major  objective  of  Project  A  Is  to  use  a  comprehensive  and  standard¬ 
ized  test  construction  procedure  to  develop  a  measure  of  training  success  for 
each  focal  MOS,  In  which  the  Item  content  represents  both  the  content  of 
training  and  the  content  of  the  job.  That  Is,  the  Items  will  sample  the  Job 
content  representatively  and  will  be  further  Identified  as  being  covered  In 
training  vs.  not  being  covered  In  training.  When  this  Is  accomplished,  a 
measure  of  direct  learning  In  training  (scores  on  Items  that  match  training 
content)  and  a  measure  of  indirect  learning  (scores  on  Items  not  directly 
related  to  training  content)  can  be  related  to  a  variety  of  Job  performance 
criteria,  with  and  without  ability  (as  measured  by  predictor  tests) 
controlled. 

On  the  way  to  developing  norm-referenced  training  achievement  tests  for 
each  of  the  19  MOS,  the  staff  visited  each  Proponent  school  and  developed  a 
description  of  the  objectives  and  content  of  the  training  curriculum.  They 
also  used  Army  Occupational  Survey  Program  (AOSP)  Information  to  develop  a 
detailed  task  description  of  job  content  for  each  MOS,  After  low-frequency 
elements  were  eliminated,  SME  judgments  were  used  to  rate  the  importance  and 
error  frequency  for  each  task  element.  Approximately  225  tasks  were  then 
sampled  proportionately  from  MOS  duty  areas. 

What  was  produced  was  a  thorough  analysis  of  the  objectives,  curriculum, 
and  assessment  procedures  for  the  key  schools.  The  process  of  describing  MOS 
job  content  and  matching  it  with  training  content  was  begun  In  FY83  and 
completed  during  FY84. 


Assessment  of  Job  Performance 

The  Initial  model  of  soldier  effectiveness  which  we  developed  was 

perhaps  a  bit  crude.  We  said  essentially  that  both  specific  task  performance 
and  the  general  factors  of  commitment,  morale,  and  organizational  socializa¬ 
tion  comprised  the  total  domain. 

During  FY83  the  task  descriptions  for  the  four  MOS  In  Batch  A  were 
completed  and  those  for  Batch  B  were  In  progress.  Virtually  all  the  critical 
Incident  workshops  necessary  for  constructing  MOS-spedflc  task  performance 
factors  were  completed.  This  most  likely  has  been  the  most  massive  effort 
ever  undertaken  to  apply  Behavlorally  Anchored  Rating  Scales  (BARS)  methods 
to  criterion  development.  There  now  exist  accounts  of  hundreds  of  critical 
Incidents  of  specific  task  performance  within  each  focal  MOS,  and  thousands 
of  critical  incidents  describing  performance  behaviors  that  have  a  general, 
not  MOS-spedflc,  referent.  These  large  samples  of  job  behaviors  were  used 
to  Identify  MOS-spedflc  and  MOS-general  performance  factors  and  (during 
FY84)  to  develop  rating  scales  to  assess  Individual  performance  on  these 
factors.  This  process  produced  a  revised  and  expanded  modal  of  the  criterion 
space  to  be  used  to  generate  further  criterion  development  work. 

An  additional  Important  outcome  of  the  Interaction  between  developing 
the  model  and  describing  tasks/behavlor  was  the  Identification  of  an  array  of 
MOS-specIflc  task  performance  factors  Intended  to  encompass  the  unique  task 
content  of  all  MOS  In  the  enlisted  personnel  job  structure.,  Although  It  was 
only  a  first  cut,  It  provides  the  basis  for  the  further  development  of  a 
standardized  set  of  task  descriptors  that  can  be  applied  to  any  MOS  to 
describe  Its  content.  Such  a  standardized  measure  will  make  It  possible  to 
answer  a  number  of  Important  questions  that  could  not  have  been  addressed 
previously.  For  example,  how  similar  are  any  two  MOS  In  terms  of  their  job 
content?  Should  they  have  a  common  selection  algorithm?  How  different  should 
their  training  schools  be? 


Predictor  Selection 

A  major  objective  that  had  to  be  accomplished  during  the  first  contract 
year  was  to  select  the  preliminary  predictor  battery  for  administration  to 
the  FY83/84  longitudinal  sample  and  to  lay  the  groundwork  for  the  development 
of  the  trial  predictor  battery.  To  do  this,  the  project  staff  carried  out  a 
massive  literature  search.  The  result  was  (a)  a  description  of  the  specific 
measures  that  might  be  useful  In  any  selection  or  classification  effort,  (b) 
a  summary  of  the  empirical  evidence  attendant  to  each  one,  and  (c)  an  expli¬ 
cation  of  the  latent  variables,  or  constructs,  that  seem  to  best  represent 
the  content  of  the  operational  measures  or  tests. 


Date  Base  Management/Valldatlon  Analysis 

Project  A  will  generate  a  large  amount  of  Interrelated  data  that  must  be 
assembled  Into  an  Integrated  data  base  that  can  be  accessed  easily  by  the 
research  tiams  for  analytical  purposes.  Therefore,  a  major  task  was  to 
establish  and  maintain  the  longitudinal  research  data  base  (LRDB),  which 
links  data  on  diverse  measures  gathered  In  the  various  tasks  of  Project  A  and 


Incorporates  existing  data  routinely  collected  by  the  Army.  Such  a  compre¬ 
hensive  LRDB  will  enable  Project  A  to  conduct  a  full  analysis  of  how  informa¬ 
tion  gathered  at  each  stage  of  the  enlistee's  progress  through  his  or  her 
Army  career  can  add  to  the  accuracy  of  predicting  later  performances. 

In  accordance  with  the  Project  A  Research  Plan,  the  LRDB  will  contain 
three  major  sets  of  data.  The  first  set  consists  of  existing  data  on  FY81/82 
accessions,  Including  accession  Information  (demographic/biographical  data, 
test  scores,  and  enlistment  options),  training  success  measures,  measures  of 
progress  or  attrition  taken  from  the  Enlisted  Master  File  (EMF),  and  specific 
Information  on  SQT  scores.  This  first  set  of  data  Is  to  be  employed  to  vali¬ 
date  the  current  version  of  the  Armed  Services  Vocational  Aptitude  Battery 
(ASVAB),  Insofar  as  that  can  be  done  with  available  criteria.  It  will  be 
used  to  Investigate  major  methodological  and  conceptual  Issues.  The  second 
and  third  major  sets  of  data  will  Involve  the  new  data  collection  efforts  of 
the  FY83/84  and  FY86/87  cohorts. 

A  significant  portion  of  the  first  year's  LRDB  activities  involved  plan¬ 
ning  the  data  base  contents  and  procedures  for  the  duration  of  the  project. 
The  main  result  of  this  activity  was  the  draft  and  final  LRDB  plan.  Other 
planning  accomplishments  Included  Installing  the  RAPID  data  storage  and 
retrieval  system,  developing  workflle  generation  and  data  set  documentation 
programs,  Identifying  and  Implementing  data  file  Integrity  and  security  pro¬ 
cedures,  and  establishing  data  editing  procedures. 

Most  of  the  substantive  LRDB  results  during  the  first  year  were  related 
to  the  creation  of  the  FY81/82  cohort  data  base  for  use  In  the  preliminary 
validation  of  the  current  ASVAB  and  the  evaluation  of  new  aptitude  area 
composites.  The  validity  and  differential  validity  of  the  existing  pre¬ 
dictors  (ASVAB  8/9/10)  against  existing  criteria  (training  grades,  SQT,  and 
administrative  outcomes)  were  being  determined  on  all  MOS  for  which  there  are 
sufficient  data.  These  results  will  serve  as  a  benchmark  against  which  the 
subsequent  validations  using  new  and/or  Improved  predictors  and  criterion 
measures  can  be  compared.  The  validity  of  alternative  composites  of  ASVAB 
subtests  can  be  compared  with  the  validity  of  the  existing  composites. 

In  Conclusion 

During  Its  first  contract  year  Project  A  stayed  on  schedule  and  within 
its  budget.  More  attention  than  the  Army's  research  staffs  had  originally 
envisioned  was  devoted  to  detailed  planning  and  outside  review.  However, 
these  thorough  and  careful  preparatory  steps  seemed  well  worthwhile  In  terms 
of  facilitating  communication  among  all  persons  associated  with  the  project 
and  uncovering  unresolved  Issues  that  would  have  plagued  us  at  some  later 
time. 

Also,  although  much  of  the  research  activity  during  the  first  year  was 
designed  as  essentially  preparatory,  some  valuable  first-year  products 
Include  the  81/82  data  file,  the  task  banks,  the  critical  Incident  banks,  and 
the  literature  review  of  the  predictor  domain. 
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Section  3 


OVERVIEW  OF  FISCAL  YEAR  1984 


During  the  second  year  of  work  on  Project  A,  the  major  efforts  were  In 
the  development  of  performance  measures  and  predictor  tests,  evaluation  of 
the  validity  of  the  ASVAB,  and  exploratory  Investigation  of  procedures  for 
scaling  the  utility  of  performance  levels  across  MOS. 

The  work  performed  during  the  second  year  Is  described  in  detail  In  the 
Annual  Report  Synopsis  for  FY84  (ARI  Research  Report  1393)  and  a  companion 
detailed  report  (ARI  Technical  Report  660),  which  also  includes  technical 
documents  that  were  prepared  during  the  year  to  support  various  aspects  of 
the  program  (and  which  is  supplemented  by  ARI  Research  Note  85*14,  containing 
additional  appendix  material).  All  three  reports  were  published  In  October 
1984. 


Project  Administration 

The  overall  administration  and  structure  of  Project  A  continued  without 
change  In  FY84.  However,  a  contract  amendment  dealing  with  the  scope  of  work 
was  designed  and  Implemented  as  envisaged  in  the  Research  Plan  (ARI  Research 
Report  1332).  The  amendment  provides  for  a  shift  in  focus  to  future  cohorts 
(from  the  FY81/82  and  FY84/85  cohorts  to  the  FY83/84  and  FY86/87  cohorts). 

It  also  specifies  the  additional  work  entailed  In: 

* 

e  Acquiring  training  school  data  on  the  FY83/84  cohort  for  predictor 
and  criterion  development. 

e  Conducting  validity  analyses  of  the  FY81/82  cohort  data. 

e  Conducting  additional  job  and  task  analyses  to  support  refinements 
In  the  MOS  sample. 

•  Preparing  detailed  analyses  to  support  the  sampling  strategy  (and 
the  resultant  Troop  Support  Requests). 

•  Developing  and  administering  the  "Preliminary  Battery." 

e  Acquiring,  using,  and  maintaining  computerized  psychomotor/ 
perceptual  test  equipment. 

e  Expanding  the  utility  research  program. 

e  Extending  the  research  schedule  through  1991  to  retain  the  objec¬ 
tive  of  analyzing  second-term  validity  data  on  the  second  (FY86/87) 
main  cohort. 

Included  In  the  changes  noted  above  was  a  requirement  for  an  extensive 
Investigation  of  psychomotor/perceptual  measures.  Implementing  this  decision 
required  the  acquisition,  use,  and  maintenance  of  computer-driven  test 
equipment. 
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During  the  course  of  the  second  year  there  were  several  personnel 
changes  In  the  Governance  Advisory  Group.  These  changes  are  reflected  in 
Figure  1.4.  There  were  also  changes  In  assignments  for  the  ARI  task  monitors 
and  consortium  task  leaders  and  other  key  personnel.  The  assignments  for 
these  monitor/leader  positions  at  the  end  of  FY84  are  reflected  in  Figure  I.F. 


Govwnanc*  Advltory  Group 

MG  8.B.  PORTER 
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OR.  R.  UNN 

OR.  M.  TENOPYR 

OR.  J.  UHIANER 

8G  W.C  KNUOSON 

COt.  IP.  AMOR 

Figure  1.4.  Governance  Advisory  Group  as  of  30  September  1984. 


School  and  Job  Performance  Measurement 


Project  A  criterion  development  was  at  the  following  point  at  the 
beginning  of  the  project's  second  year  in  October  1983: 

•  The  critical  incident  procedure  had  been  used  with  two  workshops  of 
officers  to  develop  a  first  set  of  22  dimensions  of  Army-wide 
rating  scales,  as  well  as  an  overall  performance  scale  and  a  scale 
for  rating  the  potential  of  an  individual  to  be  an  effective  NCC. 

•  The  critical  incident  procedure  had  also  been  used  to  develop 
dimensions  of  technical  performance  for  each  of  the  four  MOS  in 
Batch  A  (13B,  Cannon  crewman;  64C,  Motor  Transport  Operator;  711, 
Administrative  Specialist;  95B,  Military  Police). 
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Figure  1.5.  Project  A  organization  as  of  30  September  1984. 


•  Trie  pool  of  30  tasks  In  each  Batch  A  MOS  that  would  be  subjected  to 

hands-on  and/or  knowledge  test  measurement  had  been  selected. 
After  preparing  job  task  descriptions,  the  staff  had  used  a  series 

of  judgments  by  subject  matter  experts,  considering  task  impor¬ 
tance,  task  difficulty,  and  Intertask  similarity,  to  select  the 
final  sets  of  tasks. 

e  In  working  toward  norm-referenced  training  achievement  tests,  at 

the  end  of  FY83  we  had  a  refined  task  sample  for  each  MOS  and 
systematic  descriptions  of  the  training  program  against  which  to 
develop  a  test  Item  budget. 

e  A  preliminary  analysis  had  been  made  of  the  feasibility  of 

obtaining  archival  performance  records  from  the  computerized 
Enlisted  Master  File  (EMF),  the  Official  Military  Personnel  File 
(OMPF),  which  Is  centrally  stored  on  microfiche,  and  the  Military 
Personnel  Records  Jacket  (201  File). 

The  principal  objectives  for  criterion  development  for  FY84  were  to 

(a)  use  the  Information  developed  in  FY83  to  construct  the  Initial  version  of 
each  criterion  measure,  (b)  pilot  test  each  Initial  version  and  modify  as 
appropriate,  and  (c)  evaluate  the  criterion  measures  for  the  four  MOS  In 
Batch  A  In  a  relatively  large-scale  field  test  (about  150  enlisted  personnel 
In  each  MOS).  The  field  test  continued  into  FY85  during  which  the  criterion 
measures  for  the  five  MOS  In  Batch  B  were  evaluated. 

During  FY84  a  pilot  version  was  developed  for  most,  but  not  all, 
criterion  measures.  The  specific  progress  on  each  measure  Is  described 
below. 

Army -Wide  Rating  Scales.  An  additional  four  critical  Incident  workshops 
Involving  77  officers  and  NCOs  were  conducted  during  FY84.  On  the  basis  of 
the  critical  Incidents  collected  In  all  workshops,  a  preliminary  set  of  15 
Army-wide  performance  dimensions  was  Identified  and  defined.  Using  a 
combination  of  workshop  and  mall  survey  participants  (N  «  61),  the  Initial 
set  of  dimensions  was  retranslated  and  11  Army-wide  performance  factors 
survived.  The  scaled  critical  Incidents  were  used  to  define  anchors  for  each 
scale,  and  directions  and  training  materials  for  raters  were  developed  and 
pretested. 

During  the  same  period  scales  were  developed  to  rate  overall  performance 
and  individual  potential  for  success  as  an  NCO.  Finally,  rating  scales  were 
constructed  for  each  of  14  conmon  tasks  that  were  identified  as  part  of  the 
responsibility  of  each  Individual  In  every  MOS. 

MOS-Spedflc  BARS  SCALES.  Four  critical  incident  workshops  Involving 
70-75  officers  and  NCOs  were  completed  for  each  of  the  MOS  In  Batch  A  and 
Batch  B.  A  retranslation  step  similar  to  that  for  the  Army-wide  rating 
scales  was  carried  out,  and  six  to  nine  MOS-speulflc  performance  rating 
scales  (Behavlorally  Anchored  Rating  Scales,  BARS)  were  developed  for  each 
MOS.  Directions  and  training  materials  for  scales  were  also  developed  and 
pretested. 
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Hands-on  Measures  (Batch  A).  After  the  30  tasks  per  MOS  were  selected 
for  Batch  A,  the  two  major  development  tasks  that  remained  before  actual 
preparation  of  tests  were  the  review  of  the  task  lists  by  the  Proponent 
schools  and  the  assignments  of  tasks  to  testing  mode  (l.e.,  hands-on  job 
samples  vs.  knowledge  testing). 

For  assignment  of  tasks  to  testing  mode,  each  task  was  rated  by  three  to 
five  project  staff  on  three  dimensions.  The  extent  to  which  a  task  was 
judged  to  require  (a)  a  high  level  of  physical  skill,  (b)  a  series  of 
prescribed  steps,  and  (c)  speed  of  performance  determined  whether  It  was 
assigned  to  the  hands-on  mode.  For  each  MOS,  15  tasks  were  designated  for 
hands-on  measurement.  Job  knowledge  test  Items  were  developed  for  all  30 
tasks. 

The  pool  of  Initial  work  samples  for  the  hands-on  measures  was  then 
generated  from  training  manuals,  field  manuals,  Interviews  with  officers  and 
job  Incumbents,  and  any  other  appropriate' source.  Each  task  "test"  was 
composed  of  a  number  of  steps  (e.g.,  In  performing  cardiopulmonary  resuscita¬ 
tion),  each  of  which  was  to  be  scored  "go,  no-go"  by  an  Incumbent  NCO.  A 
complete  set  of  directions  and  training  materials  for  scorers  was  also  devel¬ 
oped.  The  Initial  hands-on  measures  and  scorer  directions  were  then  pre¬ 
tested  on  5  to  10  job  Incumbents  In  each  MOS  and  revised. 

MQS-Speclflc  Job  Knowledge  Tests  (Batch  A).  A  paper-and-pencil , 
multiple-choice  job  knowledge  test  was  developed  to  cover  all  of  the  30  tasks 
In  the  MOS  lists.  The  Item  content  was  generated  on  the  basis  of  training 
materials,  job  analysis  Information,  and  Interviews,  with  an  average  of  about 
nine  Items  prepared  for  each  of  the  30  tasks,  ror  the  15  tasks  also  measured 
hands-on,  the  knowledge  Items  were  Intended  to  be  as  parallel  as  possible  to 
the  steps  that  comprised  the  hands-on  mode.  The  knowledge  tests  were  pilot 
tested  on  approximately  10  job  Incumbents  per  MOS. 

Task  Selection  and  Test  Construction  for  Batch  B.  By  the  end  of  FY94, 
basic  task  descriptions  had  been  developed  Tor  Batch  B  in  a  manner  similar  to 
that  used  for  Botch  A.  However,  task  descriptions  had  not  yet  been  submitted 
to  SME  judgments  about  difficulty,  Importance,  and  similarity.  The  remaining 
steps  of  task  selection,  Proponent  review,  assignment  to  testing  mode,  and 
test  construction  were  carried  out  In  FY85. 

In  addition,  for  Batch  B  a  formal  experimental  procedure  was  used  to 
determine  the  effects  of  scenario  differences  on  SME  judgment  of  task 
Importance.  The  design  called  for  30  SMEs  to  be  randomly  assigned  to  one  of 
three  scenarios  (garrison  duty/peacetime,  full  readiness  for  a  European 
conflict,  and  an  outbreak  of  hostilities  In  Europe). 

Training  Achievement  Tests  (Batch  A) .  During  FY84,  generation  of 
refined  task  lists  for  each  of  the  19  MOS  in  the  Project  A  sample  continued. 
For  each  MOS  In  Batch  A,  an  Item  budget  was  prepared  matching  job  duty  areas 
to  course  content  modules  and  specifying  the  number  of  Items  that  should  be 
written  for  each  combination.  An  Item  pool  that  reflected  the  Item  budget 
was  then  written  by  a  team  of  SMEs  contracted  for  that  purpose. 
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Training  content  SMEs  and  job  content  SMEs  then  judged  each  Item  In 
terms  of  its  Importance  for  the  job  (under  each  of  the  three  scenarios.  In  a 

repeated  measures  design),  Its  relevance  for  training,  and  Its  difficulty. 
The  Items  were  then  "retranslated"  back  Into  their  respective  duty  areas  by 
the  job  SMEs  and  Into  their  respective  training  modules  by  the  training 
SMEs.  Items  were  designated  as  "job  only"  If  they  reflected  task  elements 
that  were  described  as  an  Important  part  of  the  job  but  had  no  match  with 
training  content;  such  Items  are  Intended  to  be  a  measure  of  Incidental 
learning  In  training. 

Administrative  (Archival)  Indexes.  A  major  effort  In  FY84  was  a 
systematic  comparison  of  information  found  in  the  Enlisted  Master  File  (EMF) , 
the  Official  Military  Personnel  File  (OMPF),  and  the  Military  Personnel 
Records  Jacket  (201  File).  A  sample  of  750  Incumbents,  stratified  by  MOS  and 
by  location,  was  selected  and  the  files  searched.  For  the  201  Files  the 
research  team  made  on-site  visits  and  used  a  previously  developed  protocol  to 
record  the  relevant  Information.  A  total  of  14  Items  of  Information, 
Including  awards,  letters  of  commendation,  .and  disciplinary  actions,  seemed, 
on  the  basis  of  their  base  rates  and  judged  relevance,  to  have  at  least  some 
potential  for  service  as  criterion  measures. 

Unfortunately,  the  microfiche  records  appeared  too  Incomplete  to  be 
useful,  and  search  of  the  201  Files  was  cumbersome  and  expensive.  It  was 
decided  to  try  out  a  self-report  measure  for  the  administrative  Indexes  and 
compare  It  to  actual  201  File  Information  for  the  people  In  the  field  trials 
during  FY85. 


Predictor  Measurement 

During  FY83,  predictor  development  activities  had  been  focused  on 
comprehensive  reviews  of  the  literature  (for  cognitive,  non-cognltlve,  and 
psychomotor  measures  respectively),  visits  to  other  personnel  research 
laboratories,  and  consultations  with  designated  experts  in  the  field.  The 
available  literature  was  systematically  catalogued  on  record  forms  designed 
for  the  project  and  the  process  of  summarizing  the  Information  was  begun. 

The  major  activities  completed  during  FY84  were: 

e  The  definition  and  identification  of  the  most  promising  predictor 
constructs. 

e  The  administration  and  Initial  analysis  of  the  Preliminary  Battery. 

•  The  development,  tryout,  and  pilot  testing  of  the  first  version  of 

the  Trial  Battery,  called  the  Pilot  Trial  Battery. 

e  The  development  and  tryout  of  psychomotor/perceptual  measures, 
using  a  microprocessor-driven  testing  device. 

Each  of  these  activities  Is  briefly  summarized  below.  A  more  complete 
description  of  new  test  development  Is  presented  In  later  sections  of  this 
report. 


Construct  Definition 


The  first  activity,  defining  and  Identifying  the  most  promising  pre¬ 
dictor  constructs,  was  accomplished  In  large  part  by  using  experts  to  provide 
structured,  quantified  estimates  of  the  empirical  relationships  of  a  large 
number  of  predictors  to  a  set  of  Army  job  performance  dimensions  (the  dimen¬ 
sions  were  defined  by  other  Project  A  researchers) .  By  pooling  the  judgments 
of  35  experienced  personnel  psychologists,  we  were  able  to  more  reliably 
Identify  the  "best"  measures  to  carry  forward  In  Project  A. 

These  estimates  were  combined  with  other  Information  from  the  literature 
review  and  Preliminary  Battery  analyses,  and  a  final,  prioritized  list  of 
constructs  was  Identified. 

This  effort  also  produced  a  heuristic  model  based  on  factor  analyses  of 
the  experts'  judgments.  This  model  organizes  the  predictor  constructs  and  job 
performance  dimensions  Into  broader,  more  generalized  classes  and  shows  the 
estimated  relationships  between  the  two  sets  of  classes.  This  analysis  Is 
fully  described  In  Wing,  Peterson,  and  Hoffman  (1984). 
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Preliminary  Batter 


Similarly,  the  Initial  analyses  of  Preliminary  Battery  data  provided 
empirical  results  to  guide  development  of  Pilot  Trial  tests.  Data  were  col¬ 
lected  with  the  Preliminary  Battery  on  four  MOS :  05C  (Fort  Gordon),  19E/K 
(Fort  Knox),  63B  (Fort  Dlx  and  Fort  Leonard  Wood),  and  71L  (Fort  Jackson). 


The  first  1,800  cases  from  a  total  sample  of  over  11,000  were  used  In 
the  Initial  analyses.  These  analyses  enabled  us  to  tailor  the  Pilot  Trial 
Battery  tests  more  closely  to  the  enlisted  soldier  population.  They  also 
demonstrated  the  relative  Independence  of  cognitive  ability  tests  and  non- 
cognltlve  Inventories  of  temperament,  Interest,  and  biographical  data.  This 
effort  Is  fully  reported  In  Hough,  Ounnette,  Wing,  Houston,  and  Peterson 
(1984). 


Pilot  Trial  Batten 


The  Information  from  the  first  two  activities  fed  Into  the  third 
activity:  the  development,  tryout,  revision,  and  pilot  testing  of  new  pre¬ 
dictor  measures,  collectively  labeled  the  Pilot  Trial  Battery.  New  measures 
were  developed  to  tap  the  ability  constructs  that  had  been  Identified  and 
prioritized.  These  measures  were  tried  out  on  three  separate  samples,  with 
Improvements  being  made  between  tryouts.  The  tryouts  were  conducted  at  Forts 
Carson,  Campbell,  and  Lewis  with  approximately  225  soldiers  participating. 
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At  the  end  of  the  second  year,  the  final  version  of  the  Pilot  Trial 
Battery  underwent  a  pilot  test  on  a  larger  scale.  Data  were  collected  to 
allow  Investigation  of  various  properties  of  the  battery,  Including  distribu¬ 
tion  characteristics,  covariation  with  ASVAB  tests,  internal  consistency  and 
test-retest  reliability,  and  susceptibility  to  faking  and  practice  effects. 
About  650  soldiers  participated  In  the  pilot  test. 


Computer-Administered  Measures 


The  development,  tryout,  revision,  and  pilot  testing  of  computerized 
measures  is  actually  a  subset  of  the  Pilot  Trial  Battery  development  effort, 
but  Is  worthy  of  separate  mention.  Several  objectives  were  reached  during 
1984.  An  appropriate  microprocessor  was  Identified  and  six  copies  were 
obtained  for  development  use.  The  ability  constructs  to  be  measured  were 
Identified  and  prioritized.  Software  was  written  to  utilize  the  microproces¬ 
sor  for  measuring  the  abilities  and  to  administer  the  new  tests  with  an 
absolute  minimum  of  human  administrators'  assistance.  A  customized  response 
pedestal  was  designed  and  fabricated  so  tha  -sponses  would  be  reliably  and 
straightforwardly  obtained  from  the  people  owing  tested.  The  software  and 
hardware  were  put  through  an  Iterative  tryout  and  revision  process. 


Data  Pase  Management/Valldatlon  Analyses 

Dur  ig  Project  A's  second  year,  the  Longitudinal  Research  Data  Base 
(LRDB)  as  expanded  dramatically.  The  first  major  validation  rosearch  effort 
was  car-led  out,  using  Information  on  existing  predictors  and  criteria  In  the 
expanded  LRDB.  The  Initial  validation  research  led  to  a  proposal  for  Improv¬ 
ing  the  Army's  existing  procedures  for  selecting  and  classifying  new 
recruits;  the  proposed  improvements  were  adopted  by  the  Army  after  thorough 
review  and  were  Implemented  In  the  ASVAB  at  the  beginning  of  FY85.  A  number 
of  smaller  research  efforts  were  also  supported  with  the  expanded  LRDB. 

Growth  of  the  LRDB 

FY84  saw  three  major  cROB  expansion  activities: 

e  The  enlargement  of  the  FY81/82  cohort  data  files. 

e  The  establishment  of  the  FY83/84  cohort  data  files. 

e  The  addition  and  processing  of  pilot  and  field  test  data  files  for 
different  predictor  and  criterion  Instruments. 

Expansion  of  the  FY81/82  Cohort  Data  Files.  During  FY83,  we  had  accumu¬ 
lated' appTTcaFron7accesTTon~^  Army  enlisted  recruits  who 

were  processed  In  FY81  or  FY82,  and  we  had  processed  data  from  AIT  courses  on 
their  success  In  training.  During  FY84,  we  added  SQT  data  providing  Informa¬ 
tion  on  the  first-tour  performance  of  these  soldiers  subsequent  to  their 
training.  SQT  Information  was  found  for  a  total  of  63,706  soldiers  In  this 
accession  cohort,  notwithstanding  the  fact  that  many  of  the  soldiers  In  this 
cohort  were  not  yet  far  enough  along  to  be  tested  In  this  time  period  and 
others  were  In  MOS  which  were  not  tested  at  all  during  this  period. 

In  addition  to  SQT  Information,  administrative  Information  from  the 
Army's  Enlisted  Master  File  was  added  to  the  FY81/82  data  base.  Key  among 
the  variables  culled  from  the  EMF  were  those  describing  attrition  from  the 
Army,  including  the  cause  recorded  for  each  attrition,  and  those  describing 
the  rate  of  progress  of  the  remaining  soldiers.  Records  were  found  for  a 


total  of  196,287  soldiers  In  this  cohort.  While  the  major  source  of  adminis¬ 
trative  Information  was  the  FY83  year-end  EMF  files,  Information  on  progress 
and  attrition  was  added  from  March  and  June  1984  quarterly  EMF  files. 

Establishment  of  the  FV83/84  Cohort  Data  Files.  During  FY84,  applica¬ 
tion  and  accession  Information  was  assembled  on  recruits  processed  during 
FY83  and  FY84.  This  cohort  Is  of  particular  Importance  to  Project  A  because 
It  Is  the  cohort  to  be  tested  In  the  Concurrent  Validation  effort.  In  addi¬ 
tion  to  accession  Information,  administrative  data  on  the  progress  of  this 
cohort  were  extracted  from  annual  and  quarterly  EMF  files. 

With  the  FY83/84  cohort,  we  began  to  Include  data  collected  on  new 
instruments  developed  by  Project  A.  Preliminary  Test  Battery  Information  was 
collected  on  more  than  11,000  soldiers  In  four  different  MOS. 

During  FY84  we  also  accumulated  archival  data  on  training  grades  for 
soldiers  In  the  four  MOS  to  which  the  Preliminary  Battery  (PB)  was  adminis¬ 
tered.  At  the  end  of  FY84,  data  were  still  being  added  on  soldiers  who  had 
taken  the  Preliminary  Battery  at  the  beginning  of  their  training.  The  data 
collected  Included  both  written  and  hands-on  performance  measures  adminis¬ 
tered  at  the  end  of  Individual  modules  as  well  as  more  comprehensive  end-of- 
course  measures.  Table  1.2  shows  the  number  of  soldiers  for  whom  training 
performance  Information  Is  available,  and  the  number  of  soldiers  for  whom 
both  types  of  Information  are  available. 


Table  1.2 

FY83/84  Soldiers  With  Preliminary  Battery  and  Training  Data 


Cases  With 

Both  PB  and  Training  Data 
Percent  Percent  of 


MOS 

Total 

PB  Cases 

Total* 

Training  Cases 

Total 

of  PB 
Total 

Training 

Total 

05C/31C 

2,411 

1,951 

833 

35 

43 

19E/K 

2,617 

2,749 

1,809 

69 

66 

63B 

3,245 

1,959 

1,223 

38 

62 

711 

3.039 

4,654 

2.079 

68 

45 

Total 

11,312 

11,313 

5,944 

*  As  of  FY84  year-end. 


Creation  of  Pilot  and  Field  Test  Data  Files.  During  FYR4,  a  great  deal 
of  Information  was  collected  In  conjunction  with  the  development  of  new 
Instruments  to  be  used  In  the  Concurrent  Validation.  The  largest 


accumulation  of  such  Information  resulted  from  the  Hatch  A  combined  criterion 
field  test.  The  combined  information  led  to  more  than  3,000  analysis 

variables  for  each  of  the  548  soldiers  tested. 


A  second  major  field  test  effort  during  FY84  involved  the  Pilot  Trial 
Battery.  Scheduling  conflicts  postponed  the  data  collection  effort  until 
very  late  In  the  fiscal  year,  so  initial  processing  of  these  data  had  only 
begun  by  the  end  of  FYC4. 

In  addition  to  the  major  field  tests  of  predictor  and  criterion  instru¬ 
ments,  data  from  a  number  of  other  efforts  were  incorporated  into  the  LRDB. 
These  included  ratings  of  task  and  Item  Importance,  pilot  tests  on  trainees 
of  the  comprehensive  job  knowledge  tests  Intended  for  training  use,  and  data 
gathered  during  the  exploratory  round  of  utility  workshops. 


ASVAB  Area  Composite  Validation 

As  a  first  step  in  its  continuing  research  effort  to  improve  the  Army's 
selection  and  classification  system,  Project  A  completed  a  largo-scale  Inves¬ 
tigation  of  the  validity  of  Aptitude  Area  Composite  tests  currently  used  by 
the  Army  as  standards  for  the  selection  and  classification  of  enlisted  per¬ 
sonnel.  This  research  had  three  major  purposes:  to  use  available  data  to 
determine  the  validity  of  the  current  operational  composite  system,  to 
determine  whether  a  four -composite  system  would  work  as  well  as  the  current 
nine-composite  system,  and  to  identify  any  potential  improvements  for  the 
current  system. 

The  ASVAB  Is  composed  of  10  cognitive  tests  or  subtests,  and  these 
subtests  are  combined  In  various  ways  by  each  of  the  services  to  form 
Aptitude  Area  (AA)  Composites.  It  Is  these  AA  composites  that  are  used  to 
predict  an  Individual's  expected  performance  In  the  service.  The  U.S.  Army 
uses  a  system  of  nine  AA  composites  to  select  and  classify  potential  enlisted 
personnel:  Clerical/Administration  (CL),  Combat  (CO),  Electronics  Repair 
(EL),  Field  Artillery  (FA),  General  Maintenance  (GM),  Mechanical  Maintenance 
(MM),  Operators/Food  (OF),  Survelllance/Communloatlons  (SC),  and  Skilled 
Technical  (ST). 

The  criterion  measures  used  In  the  Project  A  analyses  as  Indexes  of 
soldier  performance  were  end-of-course  training  grades  and  SQT  scores.  While 
both  have  some  limitations,  they  were  the  best  available  measures  of  soldier 
performance.  These  two  criteria  were  first  standardized  within  M0$,  and  then 
combined  to  form  »  single  index  of  a  soldier's  performance  in  his  or  her  MOS. 

One  unique  aspect  of  the  composite  phase  of  the  research  was  the  large 
size  of  the  samples  used  in  the  analyses.  The  total  sample  size  of  nearly 
65,000  soldiers  renders  this  research  one  of  the  largast  (if  not  the  largest) 
validity  Investigations  conducted  to  date. 

The  validities  obtained  in  this  research  for  the  current  nine  AA  compos¬ 
ites  are  given  In  Figure  1.6.  As  can  be  seen,  the  existing  composites  are 
very  good  predictors  of  soldier  performance.  The  composite  validities  ranged 
from  a  low  of  .44  to  a  high  of  .58  with  the  average  validity  being  about.  .48. 


These  numbers  for  the  existing  predictors  are  about  as  high  as  one  Is  likely 
to  find  In  measuring  test  validities. 


CL  CO  U.  PA  OM  MM  OP  SC  ST 

_  Aptitude  Area  gutter 

EZ2  Mine  eompntllee  E£S3  Pour  eompotllee 


Figure  1.6.  Predictive  validities  systems  for  nine  and  four  Aptitude  Area  composites. 


A  second  finding  was  that,  despite  the  high  validities  of  the  existing 
composites,  a  set  of  four  nowly  defined  AA  composites  could  be  used  to 
replace  the  current  nine  without  a  decrease  In  composite  validity.  This  set 
of  four  alternative  composites  Included:  a  new  composite  for  the  CL  cluster 
of  MOS ;  a  single  new  composite  for  the  CO,  EL,  FA,  and  GM  MOS  clusters;  a 
single  new  composite  for  the  GM,  MM,  OF,  and  SC  MOS  clusters;  and  a  new  com¬ 
posite  for  the  ST  cluster  of  MOS, 

Figure  1.6  also  shows  the  test  validities  (corrected  for  range  restric¬ 
tion)  for  this  four-composite  system  when  It  Is  used  to  predict  performance 
In  the  nine  clusters  of  MOS  defined  by  the  current  system.  In  all  cases  the 
four-composite  solution  showed  test  validities  equal  to  or  greater  than  the 
existing  nine-composite  case. 

A  corollary  finding  of  the  Investigation  Into  the  four-composite  solu¬ 
tion  was  that  the  validities  for  two  of  the  nine  composites  could  be 
substantially  Improved  without  making  major  changes  to  the  entire  system. 
This  improvement  was  accomplished  by  dropping  two  speeded  subtests  (Numerical 


Operations  and  Coding  Speed)  from  the  CL  and  SC  composites  and  replacing  them 
with  the  Arithmetic  Reasoning  and  Mathematical  Knowledge  subtests  for  the  CL 

composite  and  the  Arithmetic  Reasoning  and  Mechanical  Comprehension  subtests 
for  the  SC  composites.  Figure  1.7  compares  the  old  and  new  forms  for  the  CL 
and  SC  composites.  This  simple  substitution  of  different  subtests  was  able 
to  Improve  the  predictive  validity  of  the  CL  composite  by  IPX  and  of  the  SC 
composite  by  11%, 


Current  ASVAB 
Composite 

Proposed 

Composite 

Subtests 

r, 

Subtests 
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MOS 
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.48 

VE+AR+MK 

.56 
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MOS 
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.45 
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Figure  1.7.  A  comparison  of  current  and  alternative  Aptitude 
Area  composites. 


On  the  basis  of  these  data,  the  Army  decided  to  Implement  the  proposed 
alternative  composites  for  CL  and  SC,  effective  1  October  19C4. 

A  fuller  discussion  of  the  research  entailed  In  the  development  and 
validation  of  the  AA  composites  can  be  found  In  McLaughlin  et  al.  (1984). 

In  Conclusion 

By  the  end  of  FYS4  the  Initial  development  and  first,  pilot,  testing  of 
all  major  predictor  and  criterion  variables  had  been  completed.  This 
Included  the  development  of  the  first  versions  of  the  hands-on  job  samples 
and  the  computer-administered  perceptual  and  psychomotor  tests. 

In  addition,  the  formal  field  tests  of  the  criterion  measures  were  begun 
on  the  Batch  A  MOS.  The  predictor  battery  designated  as  the  Pilot  Trial 
Battery  also  underwent  a  more  comprehensive  pilot  testing  on  approximately 
650  soldiers. 

Finally,  by  the  end  of  FY84  the  longitudinal  research  data  base  had  been 
designed  and  the  revalidation  of  the  ASVAD  using  FYG1/82  file  data  had  been 
completed. 
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Section  4 


FISCAL  YEAR  1985 

The  third  year  of  Project  A  was  both  resource  and  effort  Intensive.  It 
was  during  FY85  that  all  predictor  and  criterion  construction  was  completed, 
all  field  tests  were  completed,  the  field  test  results  were  analyzed,  the 
final  revisions  (before  validation)  of  the  measures  were  made,  and  the 
Concurrent  Validation  data  collection  was  begun. 


Project  Administration 

During  the  third  year's  work,  several  changes  were  effective  In  the 
Governance  Advisory  Group.  These  changes  are  reflected  In  Figure  1.8.  There 
were  also  changes  among  the  ARI  task  monitors  and  the  consortium  task  leaders 
and  other  key  personnel.  The  assignments  for  these  positions  at  the  end  of 
FY85  are  shown  In  Figure  1.9. 


Research  Activities 
A  summary  schedule  of  FY85  events  Is  as  follows: 


1. 

Completion  of  field  tests  for  pilot 
trial  predictor  battery. 

March  1985 

2. 

Analysis  of  predictor  field  test 
data. 

October  1984-Aprll 

1985 

3. 

Revision  of  Pilot  Trial  Battery  to 
form  the  Trial  Battery. 

May- June 

1985 

4. 

Completion  of  Batch  8  criterion 
field  tests. 

March 

1985 

5. 

Analysis  of  Batch  A  and  Batch  B 
field  test  data. 

December  1984-June 

1985 

6. 

Revision  of  criterion  measures  for 
use  in  Concurrent  Validation. 

May-June 

1985 

7. 

Army  proponent  review  of  Instruments 
used  In  Concurrent  Validation. 

May 

1985 

8. 

Start  of  Concurrent  Validation. 

June 

1985 

Since  the  project's  third  year  was  such  a  crucial  one  and  represented  a 
culmination  of  e  great  deal  of  basic  development  work,  it  seems  appropriate 
to  use  this  FY85  report  to  summarize  In  some  detail  the  first  3  years  of 
Froject  A.  Consequently,  tho  major  parts  of  Project  A  will  be  described, 
with  the  discussion  emphasizing,  but  not  limited  to,  the  activities  during 
FY85.  This  comprehensive  presentation  Includes  a  discussion  of  how  each 
phase  of  the  research  was  conceptualized  and  Initially  formulated,  as  well  as 


Figure  1.8  Project  A  Management  Group  as  of  30  September  1985. 
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Flgurt  1.9.  Project  A  organization  aa  of  30  Saptambar  1985. 


a  description  of  what  was  actually  done.  Bear  with  us  as  we  go  back  to  year 
one  and  pick  up  the  story  after  the  organization  and  staffing  of  the  project 
had  been  agreed  upon,  the  research  plan  had  been  completed  and  approved,  and 
work  had  begun  In  earnest  on  each  substantive  task. 


Organization  of  the  Report 


Iri 

R 


The  remainder  of  this  report  Is  divided  Into  two  major  sections;  Part  II 
describes  the  development  and  field  test  of  the  predictor  battery,  and  Part 
III  deals  with  the  development  and  field  test  of  the  performance  measures. 
By  30  May  1985  the  final  array  of  predictors  and  criteria  to  be  used  In  the 
Concurrent  Validation  was  agreed  upon,  and  the  validation  collection  began  In 
June.  The  basic  procedures  employed  In  the  Concurrent  Validation  are 
described  In  the  last  section  of  this  report,  Part  IV. 

This  report  Is  supplemented  by  an  ARI  Research  Note  (In  preparation), 
which  contains  various  technical  papers  prepared  during  FV85  In  connection 
with  specific  aspects  of  the  Project  A  research  activities.  These  papers  are 
listed  In  Appendix  A  of  the  present  report. 
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PART  II 


PREDICTOR  DEVELOPMENT 


After  discussion  of  the  general  Issues  Involved  in 
Project  A's  predictor  development  efforts,  the  specific 
development  steps  and  initial  pilot  testing  of  each  major 
predictor  type  will  be  described.  After  all  predictors 
are  discussed  In  turn,  the  full-scale  field  tests  will  be 
described  and  the  revisions  to  the  Instruments  mode  on  the 
basis  of  the  field  tests  will  be  outlined. 


Section  1 

INTRODUCTION  TO  PREDICTOR  DEVELOPMENT1 


This  section  describes  the  development,  Initial  pilot  testing,  and  field 
testing  of  the  Trial  Battery.  The  TV* la  1  Battery  Is  the  array  of  new  enlisted 
selection/classification  tests  that  are  being  evaluated  In  the  Concurrent 
Validation  sample.  Again,  the  overall  objective  Is  to  develop  and  validate 
tests  that  supplement  the  Armed  Services  Vocational  Aptitude  Battery  (ASVAP.) 
and  broaden  the  domain  of  potential  selection  measures  for  U.S.  Army 
first- tour  enlisted  personnel. 

Project  A  has  adopted  a  construct-orlerited  strategy  of  predictor  devel¬ 
opment  and  endeavored  to  build  a  model  of  the  predictor  space  by  (a)  Identi¬ 
fying  the  major  domains  of  constructs,  (b)  selecting  measures  within  each 
domain  that  met  a  number  of  psychometric  and  pragmatic  criteria,  and  (c) 
specifying  those  constructs  that  appeared  to  be  the  "best  bets"  for  incre¬ 
menting  prediction  of  training/job  performance  and  attrition/retention  In 
Army  jobs. 

Ideally,  the  model  would  lead  to  the  selection  of  a  finite  set  of 
relatively  Independent  predictor  constructs  that  are  also  Independent  of 
present  predictors  and  maximally  related  to  the  criteria  of  interest.  If 
these  conditions  were  met,  then  the  resulting  set  of  measures  would  yield 
valid  prediction  within  each  job,  yet  possess  enough  heterogeneity  to  yield 
valid  classification  of  persons  Into  different  jobs. 


Objective 

This  approach  led  to  the  delineation  of  a  set  of  more  concrete 
objectives: 

1.  Identify  existing  measures  of  human  abilities,  attributes,  or 
characteristics  that  are  most  likely  to  be  effective  In  predicting 
successful  soldier  performance  and  In  classifying  persons  Into  !!0S 
where  they  will  be  most  successful,  with  special  emphasis  on 
attributes  not  tapped  by  current  pre-enlistment  measures. 


^art  II  is  based  primarily  on  ARI  Technical  Report  739,  Development  and 
Field  Test  of  the  Trial  Battery  for  Project  A,  Norman  Peterson,  Editor,  and  a 
supplementary  ARI  Research  Note  (in  preparation;,  which  contains  the  report 
appendixes  that  present  the  tests  used  in  the  Pilot  Trial  Battery  and  Trial 
Battery  administration.  Authors  of  various  portions  of  this  report  Include 
Norman  Peterson,  Jody  Toquam,  Leaetta  Hough,  Janis  Houston,  Rodney  Rosse, 
Jeffrey  McHenry,  Teresa  Russell,  VyVy  Corpe,  Matthew  McGue,  Bruce  Barge, 
Marvin  Dunnette,  John  Kawp,  and  Mary  Ann  Hanson. 
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2.  Where  appropriate,  design  and  develop  new  measures  or  modify 
existing  measures  of  these  "best  bet"  predictors. 

3.  Estimate  and  evaluate  the  reliability  of  the  new  pre-enlistment 
measures  and  their  vulnerability  to  motivational  set  differences, 
faking,  variances  In  administrative  settings,  and  practice  effects. 

4.  Determine  the  Interrelationships  (or  covariance)  between  the  new 
pre-enlistment  measures  and  current  pre-enlistment  measures. 

5.  Determine  the  degree  to  which  the  validity  of  new  pre-enlistment 
measures  generalizes  across  Military  Occupational  Specialties 
(MOS),  that  Is,  proves  useful  for  predicting  measures  of  successful 
soldier  performance  across  quite  different  MOS,  and,  conversely, 
the  degree  to  which  the  measures  are  useful  for  classification  or 
the  differential  prediction  of  success  across  MOS. 

6.  Determine  the  extent  to  which  new  pre-enlistment  measures  Increase 
the  accuracy  of  prediction  cf  success  and  the  accuncy  of 
classification  Into  MOS  over  and  above  the  levels  of  accuracy 
reached  by  current  pre-enlistment  measures. 


General  Research  Design  and  Organization 

To  achieve  these  objectives,  we  have  followed  the  design  depicted  In 
Figure  II. 1.  Several  things  are  noteworthy  about  the  IS  subtasks  In  the 
research  plan.  First,  five  test  batteries  are  mentioned:  Preliminary 
Battery,  Demonstration  Computer  Battery,  Pilot  Trial  Battery,  Trial  Battery, 
and  Experimental  Battery.  These  appear  In  sequence,  a  schedule  that  allows 
us  to  Improve  the  predictors  as  data  are  gathered  and  analyzed  on  each 

successive  battery  or  set  of  measures.  Second,  a  large-scale  literature 
review  and  an  expert  judgment  procedure  were  utilized  early  In  the  project  to 
take  maximum  advantage  of  previous  research  and  accumulated  expert 
knowledge.  The  expert  judgments  were  used  early  on  to  develop  an  Initial 
model  of  both  the  predictor  space  and  the  criterion  space,  which  also  relied 
hr  .vily  on  the  Information  gained  from  the  literature  review.  Third,  the 
design  Includes  both  predictive  and  concurrent  validation  designs. 

The  project  staff  were  organized  Into  three  "domain  teams."  One  team 
concerned  itself  with  temperament,  biographical,  and  vocational  Interest 
variables  and  came  to  be  called  the  "non-cognltive"  team.  Another  team 

examined  cognitive  and  perceptual  variables  and  was  called  the  "cognitive" 
team.  The  third  team  concentrated  on  psychomotor  and  perceptual  variables 
and  was  labeled  the  "psychomotor"  team  or  sometimes  the  "computerized"  team, 
since  all  the  measures  developed  by  that  team  were  computer-administered. 

We  turn  now  to  a  description  of  the  Initial  research  activities  devoted 
to  development  of  new  predictors,  specifically:  the  literature  review;  ex¬ 
pert  judgments;  development,  administration,  and  analysis  of  the  Preliminary 
Battery;  and  Initial  development  of  a  computer  battery.  As  Figure  II. 1 

shows,  all  of  the  these  activities  led  up  to  a  development  of  the  Pilot  Trla1 

Battery. 


1 1-4 


Figure  11.1.  Flow  chart  of  predictor  measure  development  activities  of  Project  A. 


Literature  Review 


The  overriding  purpose  of  the  literature  review  was  to  gain  maximum 
benefit  from  earlier  research  on  selection/classification  measures  that  were 
even  remotely  relevant  for  the  jobs  In  the  Project  A  job  population. 


Search  Procedures 

The  search  was  conducted  by  the  three  research  teams,  each  responsible 
for  the  broadly  defined  area  of  human  abilities  or  characteristics  mentioned 
previously.  These  areas,  or  domains,  proved  to  be  convenient  for  purposes  of 
organizing  and  conducting  literature  search  activities,  but  were  not  used  as 
(nor  Intended  to  be)  a  final  taxonomy  of  possible  predictor  measures. 

The  literature  search  was  conducted  In  late  1982  and  early  1983  (l.e., 
FY83).  Within  each  of  the  three  areas,  the  teams  carried  out  essentially  the 
same  steps: 

1.  Compile  an  exhaustive  list  of  potentially  relevant  reports, 
articles,  books,  or  other  sources. 

2.  Review  each  source  and  determine  Its  relevancy  for  the  project  by 
examining  the  title  and  abstract  (or  other  brief  review). 

3.  Obtain  the  sources  Identified  as  relevant  in  the  second  step. 

4.  For  relevant  materials,  carry  out  a  thorough  review  and  transfer 
relevant  Information  onto  summary  forms  specifically  developed  for 
the  project. 

Within  Step  1,  several  computerized  searches  of  relevant  data  bases 
were  generated.  Across  all  three  ability  areas,  over  10,000  sources  were 
Identified  via  the  computer  search,  Many  of  these  sources  were  Identified  as 
relevant  In  more  than  one  area,  and  were  thus  counted  more  than  once. 

In  addition  to  the  computerized  searches,  we  solicited  reference  lists 
from  recognized  experts  In  each  of  the  areas,  obtained  several  annotated 
bibliographies  from  military  research  laboratories,  and  scanned  the  last 
several  years'  editions  of  relevant  research  journals  as  well  as  more  general 
sources  such  as  textbooks,  handbooks,  and  appropriate  chapters  In  the  Annual 
Review  of  Psychology. 

The  vast  majority  of  the  references  Identified  In  Step  1  were  not  rele¬ 
vant  to  Project  A  and  were  eliminated  In  Step  2.  The  references  Identified 
In  Step  2  were  obtained  and  reviewed,  and  two  forms  were  completed  for  each 
source:  an  Article  Review  form  and  a  Predictor  Review  form  (several  of  the 
latter  could  be  completed  for  each  source).  These  forms  were  designed  to 
capture,  In  a  standard  format,  the  essential  Information,  which  varied  con¬ 
siderably  In  organization  and  reporting  style  In  the  original  documents. 

The  Article  Review  form  contained  seven  sections:  citations,  abstract, 
list  of  predictors  (keyed  to  the  Predictor  Review  forms),  description  of 
criterion  measures,  description  of  sample(s),  description  of  methodology, 
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other  results,  and  reviewer's  comments.  The  Predictor  Review  form  also  con¬ 
tained  seven  sections;  description  of  predictor,  reliability,  norms/ 
descriptive  statistics,  correlations  with  other  predictors,  correlations  with 
criteria,  adverse  Impact/differential  valldlty/test  fairness,  and  reviewer's 
recommendations  (about  the  usefulness  of  the  predictor).  Each  predictor  was 
tentatively  classified  Into  an  Initial,  working  taxonomy  of  predictor  con¬ 
structs  (based  primarily  on  the  taxonomy  described  In  Peterson  and  Bownas, 
1982). 


Literature  Search  Results 


The  literature  search  was  used  In  two  major  ways.  First,  three  working 
documents  were  written,  one  for  each  of  the  three  areas:  cognitive/ 
perceptual  abilities,  psychomotor/perceptual  abilities,  and  non-cognltlve 
predictors  (Including  temperament  or  personality,  vocational  Interest,  and 
biographical  data  variables).  These  documents  summarized  the  literature  with 
regard  to  critical  Issues,  suggested  the  most  appropriate  organization  or 
taxonomy  of  the  constructs  In  each  area,  and  summarized  the  validities  of  the 
various  measures  for  different  types  of  job  performance  criteria.  Second, 
the  predictors  Identified  In  the  review  were  subjected  to  further  scrutiny  to 
(a)  select  tests  and  Inventories  to  make  up  the  Preliminary  Battery,  and  (b) 
select  the  "bast  bet"  predictor  constructs  to  be  used  In  the  "expert  judg¬ 
ment"  research  activity.  We  turn  now  to  a  description  of  that  screening 
process. 


Screening  of  Predictors 

An  Initial  list  was  compiled  of  all  predictor  measures  that  seemed  even 
remotely  appropriate  for  Army  selection  and  classification.  This  list  was 
further  screened  by  eliminating  measures  according  to  several  "knockout" 
factors:  (a)  measures  developed  for  a  single  research  project;  (b)  measures 
designed  for  a  narrowly  specified  population/occupational  group  (e.g., 
pharmacy  students);  (c)  measures  targeted  toward  younger  age  groups;  (d) 
measures  requiring  special  apparatus  for  administration;  (e)  measures  requir¬ 
ing  unusually  long  testing  times;  (f)  measures  requiring  difficult  or  subjec¬ 
tive  scoring;  and  (g)  measures  requiring  individual  administration. 

Knockout  factor  (d)  was  applicable  only  with  regard  to  screening  for  the 
Preliminary  Battery,  which  could  not  have  any  computerized  tests  or  other 
apparatus  since  It  was  to  be  administered  early  In  the  project,  before  such 
testing  devices  could  be  developed.  Factor  (d)  was  not  applied  with  regard 
to  screening  measures  for  Inclusion  In  the  expert  judgment  process. 

The  result  of  the  application  of  knockout  factors  was  a  second  list  of 
candidate  measures.  Each  of  these  measures  was  evaluated,  by  at  least  two 
researchers,  on  the  12  factors  shown  In  Figure  II. 2.  (A  5-polnt  rating  scale 
was  applied  to  each  of  the  12  factors.)  Discrepancies  in  ratings  were 
resolved  by  discussion.  There  was  not  always  sufficient  Information  for  a 
variable  to  allow  a  rating  on  all  factors. 


1 1-6 


A 

ft 

■ft 


s 

V 


\ 

$! 

s 

* 

S 

s 


1.  Discrimlntbil Ity  -  extant  to  which  tha  mauura  hat  sufficient  score 
range  and  variance,  l.e.,  does  not  suffer  from  calling  and  floor 
effects  with  respect  to  the  applicant  population. 

2.  Reliability  -  degree  of  reliability  at  Measured  by  traditional  psycho¬ 
metric  methods  such  as  test-retest,  Internal  consistency,  or  parallel 
forms  reliability. 

3.  Croup  Score  Differences  (Differential  Impact)  •  extent  to  which  there 
are  mean  and  variance  differences  In  scores  across  groups  defined  by 
age,  sex,  race,  or  ethnic  grouptt  a  high  score  Indicates  little  or  no 
moan  differences  across  these  groups. 

4.  Conslstency/Robustneis  of  Administration  and  Scoring  •  extent  to  which 
administration  and  scoring  ,1s  standardised,  ease  of  administration  and 
scoring,  consistency  of  administration  and  scoring  across  administra¬ 
tors  and  locations. 

5.  General Ity  •  extant  to  which  predictor  measures  a  fairly  general  or 

‘  ‘  ability  ■  ■ 


broad 


or  construct. 


6,  Criterion-Related  Validity  -  tha  level  of  correlation  of  tha  predictor 
as  a  measure  of  Job  performance,  training  performance  and  turnover/at¬ 
trition. 

7.  Construct  Validity  •  the  amount  of  evidence  existing  to  support  the 
predictor  as  a  measure  of  a  distinct  construct  (correlational  studies, 
experimental  studies,. etc.). 

8,  Face  Validity/Applicant  Acceptance  •  extant  to  which  tha  appearance 
and  administration  methods  of  the  predictor  enhance  or  detract  from 
its  plausibility  or  acceptability  to  laymen  as  an  appropriate  test  for 
the  Army. 

9.  Differential  Validity  •  existence  of  significantly  different 


ul rrerential  validity  •  existence  or  slgnl ricantly  dirrerant 
criterion-related  validity  coefficients  between  groups  of  legal  or 
societal  concern  (race,  sax,  age);  a  high  score  Indicates  little  or 
no  differences  In  validity  for  these  groups. 


10.  Test  Fairness  -  degree  to  which  slopes,  Intercepts,  and  standard 
errors  of  estimate  dl ffer*across  groups  of  legal  or  societal  concern 
(race,  sex,  age)  when  predictor  scores  are  regressed  on  Important 
criteria  (job  performance,  turnover,  training)]  a  high  score  Indicates 
fairness  (little  or  no  differences  In  slopes,  Intercepts,  and  standard 
errors  of  estimate). 

11.  Usefulness  of  Classification  -  axtant  to  which  tha  measura  or  predlc- 
tor  will  be  usaful  In  classifying  persons  Into  different  specialties. 

12.  Overall  Usefulness  for  Predicting  Army  Criteria  •  axtant  to  which 
predictor  Is  likely  to  contribute  to  the  overall  or  Individual  predic¬ 
tion  of  criteria  Important  to  tha  Army  (a.g.,  AWOL,  drug  usa,  attri¬ 
tion,  unsuitability,  Job  performance,  and  training). 


Figure  II. 2.  Factors  used  to  evaluate  predictor  measures  for  the 
Preliminary  Battery. 
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This  second  list  of  measures,  each  with  a  set  of  evaluations,  was  input 
to  (a)  the  final  selection  of  measures  for  the  Preliminary  Battery  and  (b) 
the  final  selection  of  constructs  to  he  Included  In  the  expert  judgment 
process. 


Expert  Forecasts  of  Predictor  Construct  Validities 

The  procedure  used  In  the  expert  judgment  process  was  to  (a)  Identify 
criterion  categories,  (b)  Identify  an  exhaustive  range  of  psychological 
constructs  that  may  be  potentially  valid  predictors  of  those  criterion 
categories,  and  (c)  obtain  expert  judgments  about  the  relationships  between 
the  two.  Schmidt,  Hunter,  Croll,  and  McKenzie  (1983)  showed  that  pooled 
expert  Judgments,  obtained  from  experienced  personnel  psychologists,  were  as 
accurate  In  estimating  the  validity  of  tests  as  actual,  empirical 

criterion-related  validity  research  using  samples  of  hundreds  of  subjects. 
That  Is,  experienced  personnel  psychologists  are  effective  "validity 
generalizes"  for  cognitive  tests,  although  they  do  tend  to  underestimate 

slightly  the  true  validity  as  obtained  from  empirical  research. 

Consequently,  one  way  to  identify  the  "best  bet"  set  of  predictor 
variables  and  measures  Is  to  use  a  formal  judgment  process  employing  experts, 
such  as  that  followed  by  Schmidt  et  al.  Peterson  and  Bownas  (1982)  provide  a 
complete  description  of  the  methodology  which  has  been  used  successfully  by 
Bownas  and  Heckman  (1976),  Peterson,  Houston,  Bosshardt,  and  Dunnette  (1977), 
Peterson  and  Houston  (1980),  and  Peterson,  Houston,  and  Rosse  (1984)  to 

Identify  predictors  for  the  jobs  of  firefighter,  correctional  officer,  and 
entry-level  occupations  (clerical  and  technical),  respectively.  Descriptive 
Information  about  a  set  of  predictors  and  the  job  performance  criterion 

variables  Is  given  to  "experts"  In  personnel  selection  and  classification, 
typically  personnel  psychologists.  These  experts  estimate  the  relationships 
between  predictor  and  criterion  variables  by  racing  or  directly  estimating 
the  value  of  the  correlation  coefficients. 

The  result  is  a  matrix  with  predictor  and  criterion  variables  as  the 
columns  and  rows,  respectively.  Cell  entries  are  experts'  estimates  of  the 
degree  of  relationship  between  the  particular  predictors  and  various 
criteria.  The  interrater  reliability  of  the  experts'  estimates  Is  checked 
first.  If  the  estimate  Is  sufficiently  reliable  (previous  research  shows 
values  In  the  .80  to  .90  range  for  about  10  to  12  experts),  the  matrix  of 
predictor-criterion  relationships  can  be  analyzed  and  used  In  a  variety  of 
ways.  By  correlating  the  columns  of  the  matrix,  the  covariances  of  the 
predictors  can  be  estimat'd  on  the  basis  of  the  profiles  of  their  estimated 
relationships  with  the  criteria.  Those  variances  can  then  be  factor  analyzed 
to  Identify  clusters  of  predictors  within  which  the  measures  are  expected  to 
exhibit  similar  patterns  of  correlations  with  different  performance  compo¬ 
nents.  Similarly,  the  criterion  covariances  can  be  examined  to  Identify 
clusters  of  criteria  predicted  by  a  common  set  of  predictors. 

Such  procedures  helped  In  Identifying  redundancies  and  overlap  In  the 
predictor  set.  The  clusters  of  predictors  and  of  criteria  are  an  important 
product  for  a  number  of  reasons.  First,  they  provide  an  efficient  and 
organized  means  of  summarizing  the  data  generated  by  the  experts.  Second, 


the  summary  form  permits  easier  comparison  with  the  results  of  neta-aualyses 
of  empirical  estimates  of  criterion-related  validity  coefficients.  Third, 

these  clusters  provide  a  model  or  theory  of  the  predictor-criterion 
performance  space. 


Method 

To  carry  out  the  expert  judgments,  a  sample  of  subject  matter  experts 
(SMEs )  was  selected,  a  universe  of  predictor  variables  and  a  universe  of 
criterion  variables  were  Identified,  and  materials  that  would  allow  the 
experts  to  provide  reliable  estimates  of  criterion-related  validity  were 
prepared. 

Subjects.  The  experts  were  35  Industrial,  measurement,  or  differential 
psychologists  with  experience  and  knowledge  in  personnel  selection  research 
and/or  applications.  Each  expert  was  an  employee  of  or  consultant  to  one  of 
the  four  organizations  Involved  In  Project  As  U.S.  Army  Research  Institute, 
Human  Resources  Research  Organization,  Personnel  Decisions  Research  Insti¬ 
tute,  and  American  Institutes  for  Research.  Hot  all  of  the  employees  were 
directly  Involved  with  Project  A  although  all  of  the  consultants  were. 

Identification  of  Predictor  Variables.  The  predictor  variables  eval- 

uated  with  regard  to  the  12  relevant  factors  (see  Screening  of  Predictors, 
above)  were  used  In  the  expert  judgment  process.  Variables  were  Included' If 
they  received  generally  high  evaluations  and  If  they  added  to  the  compre¬ 
hensiveness  of  coverage  for  a  particular  domain  of  predictor  variables.  The 
names  and  definitions  of  these  variables  are  shown  In  Appendix  C  of  ARI 
Technical  Report  739  noted  previously. 

Materials  describing  each  of  the  53  variables  were  prepared.  Each 

packet  contained  a  sheet  that  named  and  defined  the  variable,  described  how 
It  was  typically  measured,  and  provided  a  summary  of  the  reliability  and 

validity  of  measures  of  the  variable.  Following  this  sheet  were  descriptions 
which  Included  the  name  of  the  test,  Its  publisher,  the  variable  It  was 

designed  to  measure,  a  description  of  the  Items  and  the  number  of  Items  on 
the  test  (in  most  cases,  sample  items  were  included),  a  brief  description  of 
the  administration  and  scoring  of  the  test,  and  brief  summaries  of  studies  of 
the  reliability  and  validity  of  the  measure. 

Identification  of  Criterion  Variables.  Several  types  of  criterion 
variables  were  identified.  The  first  type  was  a  set  of  specific  job  task 
categories.  Short  of  enumerating  all  job  tasks  1  ii  t  h c  ‘ n e ar  1  y  2 4 d  e  n  t  ry  - 1  e  v  el 
ToF  specialties,  the  nature  of  the  performance  domain  had  to  be  characterized 
in  a  way  that  was  at  once  comprehensive,  understandable,  and  usable  by  judges. 

The  procedure  used  was  based  on  more  general  job  descriptions  of  a 
representative  sample  of  111  jobs  that  had  been  previously  clustered  by  job 
experts  as  part  of  the  HUS  sample  selection  described  In  the  introduction  to 
this  volume.  Criterion  categories  were  developed  by  reviewing  the  descrip¬ 
tions  of  the  jobs  in  these  clusters  to  determine  common  job  activities. 
Emphasis  was  placed  on  determining  what  a  soldier  in  each  job  might  be 


observed  doing  ?nd  what  he  or  she  might  be  trying  to  accomplish;  the  activi¬ 
ties  (e.g.,  transcribe,  annotate,  sort,  index,  file,  retrieve)  lead  to  some 
common  objective  (e.g.,  record  and  file  information).  Criterion  categories 
often  Included  reference  to  the. use  of  equipment  or  other  objects. 

Once  criterion  categories  were  Identified  for  the  common  actions  In  the 
23  clusters,  additional  categories  were  Identified  to  cover  unique  aspects  of 
jobs  In  the  sample  of  111.  In  all,  53  categories  were  generated.  Most  of 
these  categories  applied  to  several  jobs,  and  most  of  the  jobs  were  charac¬ 
terized  by  activities  from  several  categories. 

The  second  type  of  criterion  variable  was  a  set  that  described  perfor¬ 
mance  In  Initial  Army  training.  Two  sources  of  Information  were  used  to 
Identify  appropriate  training  performance  variables:  archival  records  of 
soldiers'  performance  In  training,  and  Interviews  with  trainers.  This  Infor¬ 
mation  was  obtained  for  eight  MOS:  Radio/Teletype  Operator,  MANPAD5  Crewman, 
Light  Vehicle/Power  Generator  Mechanic,  Motor  Transport  Operator,  Food 
Service  Specialist,  M60  and  Ml  Armor  Crewman,  Administrative  Specialist,  and 
Unit  Supply  Specialist.  These  specialties  represented  a  heterogeneous  group 
with  respect  to  type  of  work  and  were,  for  the  most  part,  high-density  MOS. 

The  review  of  archival  records  was  Intended  to  Identify  the  type  of 
measures  used  to  evaluate  training  performance,  since  the  content  was, 
obviously,  specific  and  unique  to  each  MOS. 

Five  or  six  trainers  were  Interviewed  for  each  MOS.  The  format  of  the 
Interview  was  a  modified  "critical  Incidents"  approach.  Trainers  were  asked 
"What  things  do  trainees  do  that  tell  you  they  are  good  (or  bad)  trainees?" 
Generally,  trainers  responded  with  fairly  broad,  trait-like  answers  and 
appropriate  follow-up  questions  were  used  to  obtain  more  specific  informa¬ 
tion  oriented  to  behavior. 

After  the  Interviews  were  conducted  and  the  archives  examined,  Informa¬ 
tion  from  both  sources  was  pooled  and  categorized.  Since  the  task  or  MOS- 
speclflc  performance  variance  was  already  covered  elsewhere,  four  variables 
were  used  to  represent  training  performance.  Their  names  and  definitions  are 
shown  <n  Appendix  C  in  ARI  Technical  Report  739. 

The  final  type  of  criterion  variable  was  a  set  of  general  performance 
categories  developed  as  part  of  Task  4  work.  Nine  behavioral  dimensions  were 
named  and  defined.  In  the  final  step,  six  more  criterion  variables  were 
added.  The  first  two,  "Survive  In  the  field"  and  "Maintain  physical 
fitness,"  were  added  because  they  represent  tasks  that  all  soldiers  are 
expected  to  be  able  to  perform  but  that  did  not  emerge  elsewhere.  The  last 
four  are  all  Important  "outcome"  criterion  variables;  that  Is,  they  represent 
outcomes  of  Individual  behavior  that  have  negative  or  positive  value  to  the 
Army  (e.g.,  disciplinary  actions),  but  the  outcomes  could  occur  because  of  a 
variety  of  Individual  behaviors. 

In  all,  then,  72  possible  criterion  constructs  were  Identified  and 
defined  for  use  In  the  expert  judgment  task. 

Instructions  and  Procedures.  Detailed  Instructions  were  provided  for 
each  judge  along  with  the  materials  describing  the  predictor  and  criterion 


variables.  First,  each  judge  was  provided  with  information  about  the  con¬ 
cepts  of  "true  validity,"  criterion-related  validity  corrected  for  such 

artifacts  as  range  restriction  and  reliability,  end  unaffected  by  variation 
in  sample  sizes.  Judges  wore  asked  to  make  estimates  of  the  level  of  true 
validity  on  a  9-point  scale.  A  rating  of  "1"  meant  a  true  validity  In  the 
range  of  .00  to  .10;  "2,"  .11  to  .20;  and  so  forth,  to  "9,"  .81  to  .20. 

Second,  descriptions  of  the  53  predictor  variables  were  placed  Into 
three  groups,  A,  B,  and  C— two  groups  of  18  and  one  of  17.  The  72  criterion 
descriptions  were  In  one  group.  Each  rater  was  encouraged  to  skim  the  mate¬ 
rials  for  a  few  predictors  and  for  all  the  criteria  before  beginning  the 
rating  task. 

Third,  each  judge  estimated  the  validity  of  each  predictor  for  each 
criterion.  The  order  of  the  predictor  groups  (A,  B,  C)  was  counterbalanced 
across  judges,  so  that  about  one-third  of  the  35  judges  began  with  group  A 
(Predictors  1-18),  another  one-third  with  Group  B  (Predictors  19-36,  end  the 
rest  with  Group  C  (Predictors  37-53). 

Ratings  were  made  on  separate  Judgment  Record  Sheets.  Before  making  any 
judgments  about  a  predictor,  the  expert  was  to  read  the  descriptive  Infor¬ 
mation  and  review  the  examples  of  Items  measuring  It.  Judgments  were  to  be 
made  about  the  predictor  as  a  construct,  not  about  the  variable  as  measured 
by  any  specific  measurement  Instrument.  Judges  were  then  to  read  the  de¬ 
scription  of  the  first  criterion.  The  validities  of  the  first  predictor 
variable  were  to  be  estimated  for  all  72  criteria  before  the  Judge  moved  on 
to  the  next  predictor. 


All  judges  completed  the  task  during  the  first  week  of  October  1983. 


Results 


A  number  of  analyses  were  carried  out:  reliability  of  the  judgments, 
means  and  standard  deviations  of  the  estimated  validities  within  each  pre¬ 
dictor/criterion  cell  and  for  various  marginal  values,  and  factor  analyses  of 
the  predictors  (based  on  their  validity  profiles  across  the  criteria)  and  the 
criteria  (based  on  their  validity  profiles  across  the  predictors). 


The  estimated  validities  were  highly  reliable  v/hen  averaged  across 
raters.  The  reliability  of  the  mean  estimated  cell  validities  was  .96.  The 
factor  analyses  were  based  on  these  cell  means.  The  most  pertinent  analysis 
for  purposes  of  this  report  concerns  the  factor  analysis  of  the  predictors. 


Factor  solutions  with  2  through  24  factors  were  calculated,  eigen  values 
diminished  below  1.0  after  9  or  10  factors.  No  more  than  eight  factors  were 
Interpretable,  so  the  eight-factor  solution  was  selected  as  most  reasonable. 
The  eight  interpretable  factors  were  named:  I,  Cognitive  Abilities;  II, 
Visualization/Spatial;  III,  Information  Processing;  IV,  llechanical;  V, 
Psyclionotor ;  VI,  Social  Skills;  VII,  Vigor;  VIII,  Motivation/Stability. 


These  eight  factors  appeared  to  be  composed  of  21  clusters,  based  on  the 
profile  of  loadings  of  each  predictor  variable  across  factors.  This 
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hierarchical  structure  of  the  predictor  variables  Is  shown  In  Figure  II. 3, 
Inspection  of  the  profile  clarifies  the  meanings  of  both  the  factors  and  the 

clusters,  as  follows. 

The  eight  predictor  factors  divide  the  predictor  domain  Into  reasonable- 
appearing  parts.  The  first  five  refer  to  abilities  and  skills  In  the  cogni¬ 
tive,  perceptual,  and  psychomotor  areas  while  the  last  three  refer  to  traits 
or  predispositions  In  the  non-cognltive  area.  Most  of  the  representative 
measures  of  the  constructs  defining  the  first  five  factors  are  of  maximal 
performance  while  most  of  the  representatl ve  measures  of  the  last  three 
factors  are  of  typical  performance,  with  the  exception  of  the  Interest 
variables. 

The  first  four  factors,  which  include  11  clusters  of  29  predictor  con¬ 
structs  or  variables,  are  cognitive-perceptual  in  nature.  The  first  factor, 
labeled  Cognitive  Abilities,  Includes  seven  clusters,  five  of  which  appear  to 
consist  of  more  traditional  mental  test  variables:  Verbal  Ability/General 
Intelligence,  Reasoning,  Number  Ability,  Memory,  Closure.  The  Perceptual 
Speed  and  Accuracy  cluster  Is  linked  to  measures  having  a  long  history  of 
inclusion  In  traditional  mental  tests.  The  seventh  cluster,  Investigative 
Interests,  refers  to  no  cognitive  test,  at  all  but  doss  tap  interest  in  things 
Intellectual ,  the  abilities  for  which  are  evaluated  In  this  factor. 

The  second  factor,  VTsuallzatlon/Spatlal ,  consists  of  only  one  cluster 
but  Includes  six  constructs  which  have  some  history  of  measuring  spatial 
ability.  Two  of  the  clusters  from  the  Cognitive  Abilities  factor,  Reasoning 
and  Closure,  have  some  affinity  to  this  second  factor,  as  may  be  seen  In  the 
factor  analysis  data.  This  may  be  due  to  the  tasks  used  to  Illustrate  the 
assessment  of  the  constructs,  which  are  to  solve  problems  of  a  visual  and 
nonverbal  nature. 

The  third  factor,  Information  Processing,  also  consists  of  only  one 
cluster,  with  the  three  constructs  referring  more  directly  to  cognitive-per¬ 
ceptual  functioning  rather  than  accumulated  knowledge  and/or  structure. 

The  fourth  factor,  Mechanical,  Includes  two  clusters,  one  of  which  con¬ 
sists  only  of  the  construct  of  Mechanical  Comprehension  while  the  other  Is, 
again,  an  Interest  cluster  consisting  of  a  positive  loading  for  Realistic 
Interests  and  negative  loading  for  Artistic  Interests. 

The  fifth  factor,  Psychomotor,  consists  of  three  clusters  which  Include 
the  nine  psychomotor  constructs.  The  first  cluster,  Steadiness/Precision, 
refers  to  aiming  and  tracking  tasks,  where  the  target  may  move  steadily  or 
erratically.  The  second  cluster,  Coordination,  Indexes  the  large-scale 
complexity  of  the  response  required  In  a  psychomotor  task  while  the  third 
factor,  Dexterity,  appears  to  Index  the  small-scale  complexity  of  responses. 

The  remaining  three  factors,  non-cognltive  in  character,  refer  more  to 
interpersonal  activities.  The  Social  Skills  factor  consists  of  two  clusters. 
The  first,  Sociability,  refers  to  a  general  Interest  In  people  while  the 
second,  Enterprising  Interests,  refers  to  a  more  specific  interest  In  working 
successfully  with  people.  The  seventh  factor  Is  called  Vigor,  as  it 
Includes  two  clusters  that  refer  to  general  activity  level.  The  first, 
Athletic  Abilities/Energy,  includes  two  constructs  which  point  toward  a 
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CONSTRUCTS 

CLUSTERS 

fACTORS 

1.  Verbal  Comprehension 

S.  Raiding  Comprehension 

14.  Ideational  fluency 

13.  Analogical  Reasoning 

21.  Omnibus  Intel l Igenee/Aptl tude 

22.  Word  fluency 

A.  Verbal  Ability/ 

Central  Intelligence 

A.  Word  Problems 

8.  Inductive  Reasoning!  Concept  formation 

10.  Deductive  Logic 

■..Reasoning 

2.  Nunerlesl  Computation 

3.  Use  of  formula/Nutber  Problems 

C.  Rubber  Ability 

COGNITIVE 

ABILITIES 

12.  Perceptual  Speed  and  Accuracy 

1.  Perceptual  Speed  and  Accuracy 

A?.  Investigative  Interests 

11.  Investigative  Interests 

14.  Rote  Memory 

17.  follow  Directions 

J.  Memory 

19.  flgural  Reasoning 

23.  Verbal  and  flgural  Closure 

f.  Closure 

4.  Two-dimensional  Mental  Rotation 

7.  Three-dimensional  Mental  Rotation 

9.  Spatial  Visualisation 

11.  field  Dependence  (Negative) 

13.  Place  Memory  (Visual  Memory) 

20.  Spatial  Scanning 

E.  Visual Ixatlon/Spatlal 

VISUALIZATION/ 

SPATIAL 

24.  Processing  Efficiency 

23.  Selective  Attention 

24.  Time  Sharing 

8.  Mental  Information  Processing 

INFORMATION 

PROCESSING 

13.  Mechanical  Comprehension 

(8.  Realistic  Interests 

51.  Artistic  Interests  (Negative) 

L.  Mechanical  Comprehension 

M.  Realistic  vs.  Artistic 

Interests 

MECHANICAL 

28.  Control  Precision 

29.  Rate  Control 

32.  Arm-hand  Steadiness 

34.  Aiming 

I.  Stesdlness/Precision 

27.  Multlllmb  Coordination 

35.  Speed  of  Arm  Movement 

0.  Coordination 

PSYCHCMOTOR 

30.  Manual  Dexterity 

31.  finger  Oexterlty 

33.  Urlst-flnger  Speed 

K.  Oexterlty 

39.  Sociability 

52.  Social  Interests 

0.  Sociability 

SOCIAL  SKILLS 

50.  Enterprising  Interests 

R.  Enterprising  Interests 

34.  Involvement  In  Athletics  and  Physical 
Conditioning 

37.  Energy  Level 

T.  Athletic  Abilltles/Energy 

YIC0R 

41.  Dominance 

42.  Self-esteem 

S.  Dominance/Self-esteem 

40.  Traditional  Values 

43.  Conscientiousness 

44.  Non- delinquency 

53.  Conventional  Interests 

N.  Traditional  Values/Convention* 
al 1 ty/Hon-del Inquency 

44.  Locus  of  Control 

47.  Work  Orientation 

0.  Work  Or  1 ent a 1 1 on/Locus 
of  Control 

MOTIVATION/ 

STABILITY 

38.  Cooperativeness 

45.  Emotional  Stability 

P.  Coopers tlcVEmot tonal  Stability 

Figure  II. 3.  Hierarchical  map  of  predictor  space. 
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physical  perspective  while  the  second,  Dominance/Seif-Esteem,  points  toward  a 

psychological  perspective.  The  eighth  and  last  factor,  Hotlvation/Stabll- 
Ity,  Includes  three  clusters  or  facets.  The  first,  Traditional  Values,  In¬ 
cludes  both  temperament  measures  and  Interest  scales,  and  refers  to  being 
rule-abiding  and  a  good  citizen.  The  second  cluster,  Work  Orientation, 
refers  to  temperament  measures  which  index  attitudes  toward  the  Individual 
vis-a-vis  hfs  or  her  efforts  In  the  world.  The  third  cluster,  Cooperation/ 
Stability,  appears  to  refer  to  skill  In  getting  along  with  people,  Including 
getting  along  with  oneself  In  a  healthy  manner. 

The  expert  judgment  task  thus  resulted  In  a  hierarchical  model  of  pre¬ 
dictor  space  that  served  as  a  guide  for  the  development  of  new,  pre-enlist¬ 
ment  measures  (the  Pilot  Trial  Battery)  for  Army  enlisted  ranks.  (Wing, 
Peterson,  &  Hoffman,  1984,  provide  a  detailed  presentation  of  the  expert 
judgment  process  and  results.)  However,  this  model  was  not  the  only 
Information  that  guided  the  development  of  the  Pilot  Trial  Battery,  and  we 
turn  now  to  the  other  major  source  of  guidance,  our  experience  In  the 
development  of  the  Preliminary  Battery. 


Development  and  Administration  of  the  Preliminary  Battery 

The  Preliminary  Battery  (PB)  was  a  set  of  proven  "off-the-shelf" 
measures  Intended  to  overlap  very  little  with  the  Army's  current  pre¬ 
enlistment  predictors.  The  collection  of  date  on  a  number  of  predictors  that 
represent  types  of  predictors  not  currently  In  use  by  the  Army  would  allow  an 
early  determination  of  the  extent  to  which  such  predictors  contributed  unique 
variance.  Also,  the  collection  of  predictor  data  (from  soldiers  In  training) 
early  In  the  project  allowed  an  assessment  of  predictive  validity  much 
earlier  than  If  we  waited  until  the  trial  Battery  was  developed  (see  Figure 
11,1),  Some  of  the  Preliminary  Battery  measures  were  also  Included  In  the 
pilot  tests  of  the  Trial  Battery  as  marker  variables. 


Selection  of  Preliminary  Battery  Measures 

As  described  earlier,  the  literature  review  identified  a  large  set  of 
predictor  measures,  each  with  ratings  by  the  researchers  on  12  psychometric 
and  substantive  evaluation  factors  (see  Figure  II. 2).  These  ratings  were 
used  to  select  a  smaller  set  of  measures  as  serious  candidates  for  Inclusion 
In  the  Preliminary  Battery.  Two  major  practical  constraints  came  Into  play: 
(a)  No  apparatus  or  Individualized  testing  methods  could  be  used  because  the 
time  available  to  prepare  for  battery  administration  was  relatively  short, 
and  because  the  battery  would  be  administered  to  a  large  number  of  soldiers 
(several  thousand)  over  a  9-month  period  by  relatively  unsophisticated  test 
administrators;  and  (b)  only  4  hours  were  available  For  testing. 

The  research  staff  made  an  Initial  selection  of  "off-the-shelf"  mea¬ 
sures,  but  there  were  still  too  many  measures  for  the  time  available.  The 
preliminary  list  was  presented  to  a  joint  meeting  of  the  ARI  and  consortium 
research  staffs,  and  the  available  information  about  each  measure  was  pre¬ 
sented  and  discussed.  A  final  set  of  measures  was  selected  at  this  meeting, 
subject  to  review  by  several  external  (to  Project  A)  consultants  who  had  been 
retained  for  their  expertise  In  various  predictor  domains,,  Subsequently, 


these  experts  reviewed  the  selected  measures  and  made  several  "fine-tuning” 
suggestions. 

The  Preliminary  Battery  Included  the  following: 

•  Eight  perceptual-cognitive  measures 

-  Five  from  the  Educational  Testing  Service  (ETS)  French  Kit 
(Ekstrom,  French,  &  Harman,  1276) 

-  Two  from  the  Employee  Aptitude  Survey  (EAS)  (Ruch  &  Ruch,  1980) 

-  One  from  the  Flanagan  Industrial  Tests  (FIT)  (Flanagan,  1965) 

•  Eighteen  scales  from  the  Air  Force  Vocational  Interest  Career 

Examination  (VOICE)  (Alley  t  Matthews,  1982) 

•  Five  temperament  scales  adapted  from  published  scales 

-  Two  from  the  Differential  Personality  Questionnaire  (DPQ) 
(Tellegen,  1982) 

-  One  from  the  California  Psychological  Inventory  (CPI)  (Gough, 
1975) 

-  The  Rotter  I/E  scale  (Rotter,  1966) 

-  Validity  scales  from  both  the  DPQ  and  the  Personality  Research 
Form  (PRF)  (Jackson,  1967) 

•  Owens'  Biographical  Questionnaire  (DQ)  (Owens  and  Schoenfeldt,  1979) 

The  EQ  could  be  scored  for  either  11  scales  for  males  or  14  for 
females  based  on  Owens'  research,  or  for  18  predesignated, 
combined-sex  scales  developed  for  this  research  and  called 
Rational  Scales.  The  rational  scales  had  no  item  on  more  than 
one  scale,  unlike  some  of  Owens'  scales.  Items  tapping  relig¬ 
ious  or  socio-economic  status  were  deleted  from  Owens'  instru¬ 
ment  for  this  use,  and  items  tapping  physical  fitness  and 
vocational-technical  course  work  were  added. 

Appendix  D  in  4RI  Technica1  Report  709  shows  all  the  scale  names  and  number 
of  items  for  the  Preliminary  Cattery. 

In  addition  to  the  Preliminary  Catcery,  scores  v/ere  available  for  the 
Armed  Services  Vocational  Aptitude  Cattery,  which  all  soldiers  take  prior  to 
entry  into  service.  ASVAG's  10  subtests  are  named  below,  with  the  test 
acronym  and  nunbet  or  items  in  parentheses: 

Word  Knowledge  ( ’. .'  K :  3  5 ) ,  Paragraph  Comprehension  (PC:  IS), 

Arithmetic  Reasoning  (AR:20),  Numerical  Operations  (f!0:SO), 
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General  Science  (GS:25),  Mechanical  Comprehension  (MC:25), 

Math  Knowledge  (MK:25),  Electronics  Information  (EI:20), 

Coding  Speed  < CS : 84) ,  Auto-Shop  Information  ( AS : 25 ) . 

All  are  considered  to  be  power  tests  except  for  NO  and  CS,  which  are 
speeded.  Prior  research  (in  Kass,  Mitchell,  Grafton,  8  Wing,  1983)  has  shown 
the  reliability  of  the  subtests  to  be  within  expectable  limits  for  cognitive 
tests  of  this  length  (l.e.,  .78  -  .92). 


Sample  and  Procedure 


The  Preliminary  Battery  was  administered  to  soldiers  entering  Advanced 
Individual  Training  (AIT)  for  four  MOS:  05C,  Radio  Teletype  Operator  (MOS 
code  was  later  changed  to  31C ) ;  19  E/K,  Tank  Crewman;  63B,  Vehicle  and 
Generator  Mechanic;  and  71L,  Administrative  Specialist.  Almost  all  soldiers 
entering  AIT  for  these  MOS  during  the  period  1  October  1983  to  30  June  1984 
completed  the  Preliminary  Battery.  We  are  here  concerned  only  with  the 
sample  of  soldiers  who  completed  the  battery  between  1  October  1983  and 
1  December  1983,  approximately  2,200  soldiers. 

The  battery  was  administered  at  five  training  posts  by  civilian  or  mili¬ 
tary  staff  already  employed  on  site.  Task  2  staff  traveled  to  these  sites  to 
deliver  battery  administration  manuals  and  to  train  the  persons  who  would 
administer  the  battery.  Before  Its  Implementation,  the  Preliminary  Battery 
was  administered  to  a  sample  of  40  soldiers  at  Fort  Leonard  Wood  to  test  the 
Instructions,  timing,  and  other  administration  procedures.  The  results  of 
this  tryout  were  used  to  adjust  the  procedures,  prepare  the  manual,  and 
Identify  topics  to  be  emphasized  during  administrator  training. 


Analyses 

An  initial  set  of  analyses  was  performed  to  inform  the  development  of 
the  Pilot  Trial  Battery  and  we  summarize  those  findings  here.  They  are  more 
completely  reported  In  Hough,  Dunnette,  Wing,  Houston,  and  Peterson  (1984). 

Three  types  of  analyses  were  done.  First,  the  psychometric  character¬ 
istics  of  each  scale  were  explored.  These  analyses  Included  descriptive 
statistics,  item  analyses  (including  numbers  of  Items  attempted  in  the  time 
allowed).  Internal  consistency  reliability  estimates,  and,  for  the  tempera¬ 
ment  Inventory,  percentage  of  subjects  failing  the  scales  Intended  to  detect 
random  or  improbable  response  patterns.  Second,  the  covariances  of  the 
scales  within  the  various  conceptual  domains  (l.e.,  cognitive,  temperament, 
biographical  data,  and  vocational  interest)  and  across  these  domains  were 
investigated.  Third,  the  covariances  of  the  Preliminary  Battery  scales  with 
ASVAB  measures  were  investigated  to  identify  any  PB  constructs  that  showed 
excessive  redundancy  with  ASVAR  constructs. 

The  psychometric  analyses  showed  some  problems  with  the  cognitive 
tests.  The  time  limits  appearea  too  stringent  for  some  tec's,  and  one  test. 
Hidden  Figures,  appeared  to  be  much  too  difficult  for  the  population  being 
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tested.  The  lesson  learned  was  that  the  Pilot  Trial  Battery  measures  should 

be  more  accurately  targeted  (In  terms  of  difficulty  of  items  and  time  limits) 
toward  the  population  of  persons  seeking  entry  Into  the  U.S.  Army. 

Mo  serious  problems  were  unearthed  with  regard  to  the  temperament,  bio¬ 
data,  and  Interest  scales.  Item-total  correlations  were  acceptably  high  and 
In  accordance  with  prior  findings  and  score  distributions  were  not  exces¬ 
sively  skewed  or  different  from  expectation.  About  8#  of  the  respondents 
failed  the  scale  that  screened  for  Inattentive  or  random  responding  on  the 
temperament  Inventory,  a  figure  that  Is  In  accord  with  findings  on  other 
selection  research. 

Covariance  analyses  showed  that  vocational  Interest  scales  were  rela¬ 
tively  distinct  from  the  biographical  and  temperament  scales,  but  the  latter 
two  types  of  scales  showed  considerable  covariance.  Five  factors  were 
Identified  from  the  <0  non-cognltive  scales,  two  that  were  primarily  voca¬ 
tional  Interests  and  three  that  were  combinations  of  biographical  data  and 
temperament  scales.  These  findings  led  us  to  consider,  for  the  Pilot  Trial 
Battery,  combining  biographical  and  temperament  Item  types  to  measure  the 
constructs  In  these  two  areas.  The  five  non-cognltive  factors  showed  rel¬ 
ative  Independence  from  the  cognitive  PB  tests,  with  the  median  absolute 
correlations  of  the  scales  within  each  of  the  five  factors  with  each  of  the 
eight  PB  cognitive  tests  ranging  from  .01  to  .21.  This  confirmed  our  expec¬ 
tations  of  little  or  no  overlap  between  the  cognitive  and  non-cognltive 
constructs. 

Correlations  and  factor  analysis  of  the  10  ASVAB  subtests  and  the  eight 
PB  cognitive  tests  confirmed  prior  analyses  of  the  ASVAB  (Kass  et  a!.,  1983) 
and  the  relative  independence  of  the  PB  tests.  Although  some  of  the  ASVAE-PE? 
test  correlations  were  fairly  high  (the  highest  was  .57),  most  were  less  than 
.30  (49  of  the  80  correlations  were  .30  or  less,  65  were  .40  or  less).  The 
factor  analysis  (principal  factors  extraction,  varimax  rotation)  of  the  18 
tests  showed  all  eight  PB  cognitive  tests  loading  highest  on  that  factor. 

The  non-cognitlve  scales  overlapped  very  little  with  the  four  ASVAP  factors 
identified  in  the  factor  analysis  of  the  ASVAB  subtests  and  PB  cognitive 
tests.  Median  correlations  of  non-cognitlve  scales  with  the  ASVAB  factors, 
computed  within  the  five  non-cognitlve  factors,  ranged  from  .03  to  .32,  but 
14  of  the  20  median  correlations  were  .10  or  less. 

The  experience  in  training  battery  administrators  and  monitoring  the 
administration  over  the  9-month  period  provided  useful  Information  for 
collecting  data  later  with  the  Pilot  Trial  Battery  and  Trial  Battery. 


Initial  Computer-Administered  Battery  Development 

Because  computerized  testing  was  a  new  area  of  test  development,  the 
initial  phase  is  given  special  attention  here.  The  measures  are  described  in 
more  detail  in  a  later  subsection.  There  were  four  phases  of  activities: 
(a)  information  gathering  about  past  and  current  research  in  the  area  of  per¬ 
ceptual  /psychonotor  measurement  and  computerized  methods  of  testing  such 
abilities;  (b)  construction  of  a  demonstration  computer  battery;  (c)  selec¬ 
tion  of  commercially  available  microprocessors  anc!  peripheral  devices,  writ¬ 
ing  of  software  for  testing  several  abilities  using  this  hardware,  and  tryout 


11-17 


of  this  hardware  and  software;  (d)  continued  development  of  software,  and  the 
design  and  construction  of  a  custom-made  peripheral  device,  which  Is  now 
called  a  response  pedestal. 


Compared  to  the  paper-and-pencll  measurement  of  cognitive  abilities  and 
the  major  non-cognltlve  variables,  computerized  measurement  of  psychomotor 
and  perceptual  abilities  was  in  a  relatively  primitive  state.  Much  work  had 
been  done  In  World  War  II  using  electro-mechanical  apparatus,  but  relatively 
little  work  had  occurred  since  then.  Microprocessor  technology  held  out  the 
promise  of  Improving  measurement  In  this  area,  but  the  work  was  (and  still 
Is)  In  Its  early  stages. 


Phase  1:  Information  Gathering 

While  almost  no  literature  was  available  on  computer-administered  (espe¬ 
cially  microprocessor-driven)  testing  of  psychomotor/perceptual  abilities  for 
selection/classification  purposes,  there  was  considerable  literature  avail¬ 
able  on  the  taxonomy  or  structure  of  such  abilities,  based  primarily  on  work 
done  In  World  War  II  or  shortly  thereafter.  Also,  work  from  this  era  showed 
that  testing  such  abilities  with  electro-mechanical  apparatus  did  produce 
useful  levels  of  validity  for  such  jobs  as  aircraft  pilot,  but  that  such 
apparatus  experienced  reliability  problems. 

To  obtain  the  most  current  Information,  In  the  spring  of  1983  we  visited 
four  military  laboratories  engaged  In  relevant  research.  The  four  sites 
visited  were  the  Air  Force  Human  Resources  Laboratory,  Brooks  Air  Force  Base; 
the  Naval  Aerospace  Medical  Research  Laboratory,  Pensacola  Naval  Station;  and 
the  Army  Research  Institute  Field  Units  at  Fort  Knox,  Kentucky,  and  Fort 
Rucker,  Alabama.  During  these  site  visits  we  gathered  much  Information,  but 
focused  primarily  on  the  answers  to  five  questions: 

1.  What  computerized  measures  are  actually  In  use? 

Over  60  different  measures  were  found  across  the  four  sites.  A 
sizable  number  of  these  were  specialized  simulators  that  were  not 
relevant  for  Project  A  (e.g.,  a  helicopter  simulator  weighing 
several  tons  that  is  permanently  mounted  In  an  air-conditioned 
building).  However,  there  were  many  measures  In  the  perceptual, 
cognitive,  and  psychomotor  areas  that  were  relevant. 

2.  What  computers  were  selected  for  use? 

3.  What  computer  languages  are  being  used? 

Three  different  microprocessors  (Apple,  Terak,  and  PDP  11)  and 
three  different  computer  languages  (PASCAL,  BASIC,  and  FORTRAN) 
appeared  to  account  for  most  of  the  activity.  However,  there 
appeared  to  be  relatively  little  In  common  among  the  four  sites. 


4.  How  reliable  are  these  computerized  measures? 

5.  What  criterion-related  validity  evidence  exists  for  these  measures 
so  far? 

Data  were  being  collected  at  all  four  sites  to  address  the 

reliability  and  criterion-related  validity  questions,  but  very 
little  documented  Information  was  available.  This  was  not 

surprising  In  light  of  the  fact  that  most  of  the  measures  had  been 

developed  only  very  recently. 

Despite  the  lack  of  evidence  on  reliability  and  validity,  we  did  learn 
some  valuable  lessons.  First,  largo-scale  testing  could  be  carried  out  on 
microprocessor  equipment  (AFHRL  was  doing  so).  Second,  a  variety  of  software 
and  hardware  could  produce  satisfactory  results.  Third,  It  would  be  highly 
desirable  to  have  the  testing  devices  or  apparatus  be  as  compact  and  simple 
In  design  as  possible  to  minimize  "down  time"  and  make  transportation  fea¬ 
sible.  Fourth,  It  would  be  highly  desirable  to  develop  our  software  and 

hardware  devices  to  be  as  completely  self-administering  (l.e.,  little  or  no 
Input  required  from  test  monitors)  as  possible  and  as  Impervious  as  possible 
to  prior  experience  with  typewriting  and  playing  video  games. 


Phase  2:  Demonstration  Battery 


After  these  site  visits,  a  short  demonstration  battery  was  programmed  on 
the  Osborne  1,  a  portable  microprocessor.  This  short  battery  was  self- 
administering,  recorded  tlme-to-answer  and  the  answer  made,  and  contained 
five  tests:  simple  reaction  time,  choice  reaction  time,  perceptual  speed  and 
accuracy  (comparing  two  alphanumeric  phrases  for  similarity),  verbal  compre¬ 
hension,  and  a  self-rating  form  (Indicating  which  of  two  adjectives  "best" 
describes  the  examinee,  on  a  relative  7-polnt  scale).  We  also  experimented 
with  the  programming  of  several  types  of  visual  tracking  tests,  but  did  not 
Include  these  In  the  self-administered  demonstration  battery. 


No  data  were  collected,  but  experience  In  developing  and  using  the 
battery  convinced  us  that  BASIC  did  not  allow  enough  power  and  control  of 
timing  to  be  useful  for  our  purposes.  The  basic  methods  for  controlling 
stimulus  presentation  and  response  acquisition  through  a  keyboard  were 
thoroughly  explored.  Techniques  for  developing  a  self-administering  battery 
of  tests  were  tried  out. 


The  second  activity  during  this  phase  was  consultation  with  three  ex¬ 
perts  at  the  University  of  Illinois  about  perceptual /psychomotor  abilities 
and  their  measurement. 2  The  major  points  were: 


The  results  obtained  In  World  War  II  using  electro-mechanical, 
psychomotor  testing  apparatus  probably  do  generalize  to  the  present 
era  In  terms  of  the  structure  of  abilities  and  the  usefulness  of 
such  abilities  for  predicting  job  performance  In  jobs  like  aircraft 
pilot. 


2charles  Hu  Tin,  John  Adams,  and  Phillip  Ackerman. 
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•  The  taxonomy  of  psychomotor  skills  and  abilities  probably  should  be 

viewed  In  a  hierarchical  fashion,  and  perhaps  Project  A's  develop¬ 
ment  efforts  would  be  best  focused  on  two  or  three  relatively  high- 
level  abilities  such  as  gross  motor  coordination,  multilimb 
constant  processing  tasks,  and  fine  manipulative  dexterity. 

•  Rate  of  learning  or  practice  effects  are  viewed  as  a  major  con¬ 
cern.  If  later  test  performance  was  more  valid  than  early  test 
performance,  or  If  early  test  performance  was  not  valid  at  all  and 
later  test  performance  was,  then  it  was  unlikely  that  psychomotor 
testing  would  be  practically  feasible  in  the  operational  military 
selection  environment.  There  were,  however,  no  empirically  based 
answers  to  these  questions,  and  It  was  acknowledged  that  research 
Is  necessary  to  answer  them. 


Phase  3:  Selection  and  Purchase  of  Microprocessors  and  Development  and  Try¬ 
out  of  Software  " 

On  the  basis  of  Information  from  the  first  two  phases,  we  defined  the 
desirable  characteristics  of  a  microprocessor  useful  for  Project  A.  Desired 
characteristics,  as  outlined  In  the  fall  of  1983  were: 

1.  Rel 1 abl 1 Ity— The  machine  should  be  manufactured  and  maintained  by  a 
company  that  has  a  proven  record  and  the  machine  Itself  should  be 
capable  of  being  moved  from  place  to  place  without  breaking  down. 

2.  Portability— The  computer  must  be  easily  moved  between  posts  during 
development  efforts. 

3.  Most  Recent  Generation  of  Machine— Progress  Is  very  rapid  In  this 
area;  therefore,  we  should  get  the  latest  "proven"  type  of  machine. 

4.  Compatlbl 1 Ity— Although  extremely  difficult  to  achieve,  a  desirable 
goal  Is  to  use  a  machine  that  Is  maximally  compatible  with  other 
machl nes. 

5.  Appropriate  Display  Size,  Memory  Size,  Disk  Drives,  Graphics,  and 
Peripheral  Capabilities— We  need  a  video  display  that  Is  at  least  9 
Inches  (diagonally),  but  It  need  not  be  color.  Since  we  will  be 
developing  experimental  software,  we  need  a  relatively  large  amount 
of  random  access  memory.  Also  we  require  two  floppy  disk  drives  to 
store  needed  software  and  to  record  subjects'  responses.  High- 
resolution  graphics  capability  Is  desirable  for  some  of  the  kinds 
of  tests.  Finally,  since  several  of  the  ability  measurement 
processes  will  require  the  use  of  paddles,  joysticks,  or  other 
similar  devices,  the  machine  must  have  appropriate  hardware  and 
software  to  allow  such  peripherals. 

In  the  end  we  selected  the  Compaq  portable  microprocessor  with  256K  RAM, 
two  320K  risk  drives,  a  "game  board"  for  accepting  Input  from  peripheral 
devices  such  as  joysticks,  and  software  for  FORTRAN,  PASCAL,  BASIC,  and 
assembly  language  programming.  Six  of  these  machines  were  purchased  In 
December  1983.  We  also  purchased  six  commercially  available,  dual-axis  joy¬ 
sticks. 
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We  chose  to  prepare  the  bulk  of  the  software  using  PASCAL  as  implemented 

by  Microsoft,  Inc.  PASCAL  software  Is  Implemented  using  a  compiler  that 
permits  modularized  software  development,  It  Is  relatively  easy  for  others  to 
read,  and  It  can  be  Implemented  on  a  variety  of  computers. 

Some  processes,  mostly  those  that  are  specific  to  the  hardware  config¬ 
uration,  had  to  be  written  In  IBM-PC  assembly  language.  Examples  Include 
Interpretation  of  the  peripheral  device  Inputs,  reading  of  the  real-time- 
clock  registers,  calibrated  timing  loops,  and  specialized  graphics  and  screen 
manipulation  routines.  For  each  of  these  Identified  functions,  a  PASCAL- 
cal 1  able  "primitive"  routine  with  a  unitary  purpose  was  written  In  assembly 
language.  Although  the  machine-specific  code  would  be  useless  on  a  different 
type  of  machine,  the  functions  were  sufficiently  simple  and  unitary  In  pur¬ 
pose  so  that  they  could  be  reproduced  with  relative  ease. 


The  overall  strategy  of  the  software  development  was  to  take  advantage 
of  each  researcher's  Input  as  directly  as  possible.  It  quickly  became  clear 
that  the  direct  programming  of  every  Item  In  every  test  by  one  person  (a 
programmer)  was  not  going  to  be  successful  In  terms  of  either  time  con¬ 
straints  or  quality  of  product.  To  make  It  possible  for  each  researcher  to 
contribute  her  or  his  judgment  and  effort  to  the  project,  It  was  necessary  to 
plan  to,  as  much  as  possible,  take  the  "programmer"  out  of  the  step  between 
conception  and  product. 

The  testing  software  modules  were  designed  as  "command  processors"  which 
Interpreted  relatively  simple  and  problem-oriented  commands.  These  were 
organized  In  ordinary  text  written  by  the  various  researchers  using  word 
processors.  Many  of  the  commands  were  common  across  all  tests.  For 
Instance,  there  were  commands  that  permitted  writing  specified  text  to 
"windows"  on  the  screen  and  controlling  the  screen  attributes  (brightness, 
background  shade,  etc.).  A  command  could  hold  a  display  on  the  screen  for  a 
period  of  time  measured  to  1/lQOth-second  accuracy.  There  were  commands  that 
caused  the  programs  to  wait  for  the  respondent  to  push  a  particular  button. 
Other  commands  caused  the  cursor  to  disappear  or  the  screen  to  go  blank  dur¬ 
ing  the  construction  of  a  complex  display. 

Some  of  the  commands  were  specific  to  particular  item  types.  These  com¬ 
mands  were  selected  and  programmed  according  to  the  needs  of  a  particular 
test  type.  For  each  Item  type,  we  decided  upon  the  relevant  stimulus  proper¬ 
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ties  to  vary  and  built  a  command  that  would  allow  the  Item  writer  to  quickly 
construct  a  set  of  commands  for  Items  which  he  or  she  could  then  inspect  on 
the  screen. 

These  techniques  made  it  possible  for  entire  tests  to  be  constructed  and 
experimentally  manipulated  by  psychologists  who  could  not  program  a  computer. 

As  this  software  was  written,  we  used  It  to  administer  the  computerized 
tests  to  small  groups  of  soldiers  (N  *  5  or  fewer)  at  the  Minneapolis  Mili¬ 
tary  Entrance  Processing  Station  (MEPS).  The  soldiers  completed  the  battery 
without  assistance  from  the  researchers,  unless  help  was  absolutely  neces¬ 
sary,  and  were  then  questioned.  The  nature  of  the  questions  varied  over  the 
progress  of  these  developmental  tryouts,  but  mainly  dealt  with  clarity  of 
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Instructions,  difficulty  of  tests  or  test  items,  screen  brightness  problems, 
difficulties  in  using  keyboard  or  joysticks,  clarity  of  visual  displays,  and 
their  general  (favorable/unfavorable)  reaction  to  this  type  of  testing. 

These  tryouts  were  held  from  20  January  1984  through  1  March  1984,  and  a 
total  of  42  persons  participated  In  nine  separate  sessions.  The  feedback  re¬ 
ceived  from  the  participants  was  extremely  useful  In  determining  the  shape  of 
the  test,  prior  to  the  first  pilot  test  of  the  Pilot  Trial  Battery. 

Phase  4:  Continued  Software  Development  and  Design/Construction  of  a 


esoonse 


By  the  end  of  Phase  3,  we  had  developed  a  self-administering,  computer¬ 
ized  test  battery  that  was  implemsnted  on  a  Compaq  portable  computer.  The 
subjects  responded  on  the  normal  keyboard  for  all  tests  except  for  a  tracking 
test  which  required  them  to  use  a  joystick,  a  commercially  available  device 
normally  used  for  video  games.  Seven  different  tests  had  been  programmed  for 
the  battery. 

During  the  fourth  phase  of  development,  several  significant  events  oc¬ 
curred.  We  made  field  observations  of  some  combat  MOS  to  obtain  Information 
for  further  development  of  computerized  tests;  the  first  pilot  test  of  the 
computerized  battery  was  completed;  we  designed  and  constructed  a  custom- 
made  response  pedestal  for  the  computerized  battery;  and  a  formal  review  of 
progress  was  conducted. 

The  primary  result  of  the  review  was  the  Identification  and  priority¬ 
setting  of  the  ability  constructs  for  which  computerized  tests  should  be 
developed.  A  second  result  was  a  decision  to  go  to  the  field  to  observe 
several  combat  arms  MOS  to  target  the  tests  more  closely  to  those  skills. 

These  field  observations  subsequently  took  place  at  several  posts.  In 
addition  to  observing  soldiers  In  the  field,  we  operated  various  training 
aids  and  simulators  that  were  available  during  our  visits.  The  MOS  for  which 
we  were  able  to  complete  these  observations  were  118  (Infantryman),  138 
(Cannon  Crewman),  19!'  (Tank  Crewman),  16S  (MANPADS  Crewman),  and  05C  (Radio/ 
Teletype  Operator) . 

The  first  pilot  test  of  the  Pilot  Trial  Battery  occurred  at  Fort  Carson 
during  this  phase.  (See  Section  2  for  a  description  of  the  sample  and  pro¬ 
cedures  of  that  pilot  test.)  With  regard  to  the  computerized  tests,  the  sane 
procedures  were  used  as  for  the  MEPS  tryouts  in  Phase  3. 

A  total  of  20  soldiers  completed  the  computerized  battery.  The  Informa¬ 
tion  obtained  at  this  pilot  test  primarily  confirmed  a  major  concern  that  had 
surfaced  during  the  MEPS  tryouts— namely,  the  undesirability  of  the  computer 
keyboard  and  commercially  available  joysticks  for  acquiring  responses.  Feed¬ 
back  from  subjects  (and  observation  of  their  test  taking)  indicated  that  it 
was  difficult  to  pick  out  one  or  two  keys  on  the  keyboard,  and  that  rather 
elaborate,  and  therefore  confusing,  Instructions  were  needed  to  use  the  key¬ 
board  In  this  manner.  Even  with  such  instructions,  subjects  frequently  mis¬ 
sed  the  appropriate  key,  or  Inadvertently  pressed  keys  because  they  were 
leaving  their  fingers  on  the  keys  in  order  to  retain  the  appropriate  position 
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for  response.  Also,  there  was  variability  In  the  way  subjects  prepared  for 
test  Items,  and  more  or  less  random  positioning  of  their  hands  added  unwanted 
(error)  variance  to  their  scores.  Similar  Issues  arose  with  regard  to  the 
joysticks,  but  the  main  problems  were  their  lack  of  durability  and  the  large 
variance  In  their  operating  characteristics. 

After  consultation  with  ARI  and  other  Project  A  researchers,  Task  2 
staff  decided  to  develop  a  custom-made  response  pedestal  to  alleviate  these 
problems  as  much  as  possible.  Accordingly,  we  drew  up  a  rough  design  for 
such  a  pedestal  and  contracted  with  an  engineering  firm  to  fabricate  a  proto¬ 
type.  We  tried  out  the  first  prototype,  suggested  modifications,  and  had  six 
copies  produced  In  time  for  the  Fort  Lewis  pilot  test  In  June  1984. 

Finally,  we  wrote  additional  software  to  test  the  abilities  that  had 
been  chosen  for  Inclusion  In  the  Pilot  Trial  Battery  and  to  accommodate  the 
new  response  pedestal. 


Identification  of  Pilot  Trial  Battery  Measures 

In  March  1984,  a  formal  In  Progress  Review  (IPR)  meeting  was  held  to 
decide  on  the  measures  to  be  developed  for  the  Pilot  Trial  Battery.  Informa¬ 
tion  from  the  literature  review,  expert  judgments.  Initial  analyses  of  the 
Preliminary  Battery,  and  the  first  three  phases  of  computer  battery  develop¬ 
ment  was  presented  and  discussed.  Task  2  staff  made  recommendations  for 
Inclusions  of  measures  and  these  were  evaluated  and  revised.  Figure  II. 4 
shows  the  results  of  that  deliberation  process. 

This  set  of  recommendations  constitutes  the  Initial  array  of  predictor 
variables  for  which  measures  would  be  constructed  and  then  submitted  to  a 
series  of  pilot  tests  and  field  tests,  with  revisions  being  made  after  each 
phase.  The  specific  measures,  the  steps  In  their  construction,  and  their 
final  form  after  pilot  and  field  testing  are  described  in  later  sections  of 
Part  II. 


Pilot  Tests  and  Field  Tests  of  the  Pilot  Trial  Battery 

There  were  three  pilot  tests  of  the  measures  developed  for  the  Pilot 
Trial  Battery,  These  took  place  In  Fort  Carson  In  April  1984,  Fort  Campbell 
In  May  1984,  and  Fort  Lewis  In  June  1984.  At  the  first  two  sites  not  all 
Pilot  Trial  Battery  measures  were  administered,  but  the  complete  battery  was 
administered  at  Fort  Lewis.  Sections  2,  3,  4,  and  5  describe  these  pilot 
tests,  resulting  analyses,  and  revisions  to  measures  prior  to  the  field 
tests.  The  reports  of  data  analyses  emphasize  the  Fort  Lewis  administration 
since  It  was  the  first  time  the  complete  battery  was  administered  and  pro¬ 
vided  the  largest  pilot  test  sample.  (The  pilot  tests  are  sometimes  referred 
to  as  "tryouts"  In  the  remainder  of  this  report.) 

There  were  three  field  tests  of  the  Pilot  Trial  Battery.  These  occurred 
at  Fort  Knox,  Fort  Bragg,  and  the  Minneapolis  MEPS  In  Fall,  1984.  These 
field  tests,  as  well  as  the  resulting  revisions  of  the  Pilot  Trial  Battery, 
are  described  in  Section  6. 


Final  Predictor 

Priority*  Catexor 


lloc  Trial  Battary  Teat  Hama* 


Cotnitlvai 


Memory . ; . .  .  (Short)  Memory  Teat  ■  Computer 

Number  . .  Number  Memory  TaeC  -  Computer 

Perceptual  Spaed  &  Accuracy  .  .  .  Perceptual  Speed  &  Accuracy  ■ 

Computer 

Target  Identification  Teat  • 
Computer 

Induction  . .  Reasoning  Taet  1 

Reasoning  Teat  2 

Reaction  Time  ..........  Simple  Reaction  Time  -  Compute 


Spatial  Orientation 


Spatial  Vleuallaatlon/Fleld 
Independence  ....... 


Spatial  Vlauallaation 


Simple  Reaction  Time  ■  Computer 
Choice  Reaction  Time  -  Computer 
Orientation  Teat  1 
Orientation  Teat  2 
Orientation  Teat  3 

Shape*  Teat 

Object  Rotation*  Teat 
Assembling  Objeeta  Teat 
Path  Teat 

Mate  Teat 


Non-Cognltlvu,  Blodata/Temperanenti 

V 

1  Adjuatment 

2  Dependability 

3  Achievement 

A  Fhyaical  Condition  I  ABLE  (Aaaeesment  of  Background 

5  Totency  Life  Experience*) 

6  Locua  of  Control  ' 

7  Agreeableneao/Llkeablllty 

1  Validity  Scale* 

Non-Co|nltlv«,  Interests i 

« 

1  Reallatlc 

2  Inveat lgatlv* 

3  Conventional  >  AVOICE  (Army  Vocational 

A  Social  Interaat  Carter  Examination) 

3  Artistic 

6  Enterprising  , 

reycliomo  tor  i 

1  Multilimb  Comblnetion  .  Target  Tracking  Taet  2  *  Computer 

Target  Shoot  -  Computer 

2  Precision  .  Target  Tracking  Teat  1  *  Computer 

3  Manual  Dexterity  .  (None) 

♦Final  priority  arrived  at  via  consensus  of  March  1984  IPR  attendants. 


AVOICE  (Army  Vocational 

Interest  Carter  Examination) 


Figure  1 1.4. 


Predictor  categories  discussed  at  IPk  In  March  1984, 
linked  to  Pilot  Trial  Battery  test  names. 
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Section  2 


SUMMARY  OF  PILOT  TESTS  PROCEDURES 


The  Initial  pilot  testing  of  the  predictor  battery  was  carried  out  in 
three  different  samples.  Not  all  tests  were  administered  to  each  sample  and 
revisions  were  made  In  the  Instruments  after  each  data  collection.  The  basic 
procedures  are  described  below,  in  summary  fashion,  to  help  maintain  clarity 
for  the  reader  as  the  results  are  discussed  in  later  sections.  The  first 
three  administrations  of  the  Pilot,  Trial  Battery  were  at  Fort  Carson,  Fort 
Campbell,  and  Fort  Lewis. 

The  tables  In  this  section  list  measures  that  have  not  yet  been  dis¬ 
cussed  In  detail  (l.e.,  the  new  tests  designed  to  measure  the  constructs 
Identified  In  Section  1).  The  Individual  new  tests  will  be  fully  described 
as  part  of  the  discussion  of  the  pilot  test  results  In  Sections  3-5.  ABLE  Is 
the  new  Inventory  developed  to  Include  temperament  and  biographical  Items. 
AVOICE  Is  an  Interest  Inventory  which  is  a  modification  of  the  VOICE  (Voca¬ 
tional  Interest  Career  Examination)  originally  developed  by  the  Air  Force. 
The  marker  tests  were  the  off-the-shelf  Instruments  that  had  also  been 
Included  In  the  Preliminary  battery. 


Pilot  Test  1;  Fort  Carson 


Sample  and  Procedure 

On  17  April  1 5284,  43  soldiers  at  Fort  Carson,  Colorado,  participated  In 
the  first  pilot  testing  of  the  Pilot  Trial  Battery.  The  testing  session  ran 
from  0800  hours  to  1700  hours,  with  two  15-minute  breaks  (one  mid-morning  and 
one  mid-afternoon),  and  a  1-hour  break  for  lunch. 

Groups  of  five  soldiers  at  a  time  were  randomly  selected  to  take  com¬ 
puterized  measures  In  a  separate  room  while  the  remaining  soldiers  took 
paper-arid-pencil  tests  (new  cognitive  tests  and  selected  marker  tests).  When 
a  group  of  five  soldiers  completed  the  computerized  measures,  they  were  In¬ 
dividually  and  collectively  Interviewed  about  their  reactions  to  the  com¬ 
puterized  tests,  especially  regarding  clarity  of  Instructions,  face  validity 
of  tests,  sensitivity  of  Items,  and  general  disposition  toward  such  tests. 
The  soldiers  then  returned  to  the  paper-and-pencll  testing  session,  and 
another  group  of  five  was  selected  to  take  the  computer  measures. 

Thus,  the  maximum  fl  for  any  single  paper-and-pencll  test  was  28  (43 
minus  5).  Computerized  measures  were  administered  to  a  total  of  20 
soldiers.  The  new  paper-and-pencll  cognitive  tests  In  the  Pilot  Trial  Bat¬ 
tery  were  each  administered  in  two  equally  timed  halves,  to  Investigate  the 
Part  1/Part  2  correlations  as  estimates  of  test  reliability. 

Actual  test  administration  was  completed  by  approximately  1545  hours. 
Ten  soldiers  were  then  selected  to  give  specific,  test-by-test  feedback  about 
paper-and-pencll  tests  In  a  small  group  session,  while  the  remaining  soldiers 
participated  in  a  more  general  feedback  and  debriefing  session. 


Tests  Admin Istered 


Table  II. 1  contains  a  list  of  all  the  tests  administered  at  Fort  Carson, 
in  the  order  in  which  they  were  administered,  with  the  tine  limit  and  the 
numoer  of  Items  for  each  test. 


Pilot  Test  2;  Fort  Campbell 

Sample  and  Procedure 

The  second  pilot  testing  session  was  conducted  at  Fort  Campbell,  Ken¬ 
tucky,  on  16  Hay  1984.  Fifty-seven  soldiers  attended  the  8-hour  session,  and 
all  5/  completed  paper-and-pencll  tests,  No  computerized  measures  were  ad¬ 
ministered  at  this  pilot  session.  Once  again,  the  10  new  cognitive  tests 
were  administered  In  two  equally  timed  halves,  to  Investigate  Part  1/Part  2 
correlations.  Because  we  were  still  experimenting  with  time  limits  on  the 
new  cognitive  tests,  soldiers  were  asked  to  mark  which  item  they  were  on  when 
time  was  called  for  each  of  these  tests,  and  then  continue  to  work  on  that 
part  of  the  test  until  they  finished.  Finishing  times  were  recorded  for  all 
the  tests  (Parts  1  and  2  separately,  where  appropriate) . 

Test  administration  was  completed  at  approximately  1600  hours,  and  the 
group  was  divided.  Ten  Individuals  were  selected  to  provide  specific 
feedback  concerning  the  new  non -cognitive  measures,  and  the  remaining 
Individuals  provided  feedback  on  the  new  cognitive  measures. 


Tests  Administered 

Table  II. 2  lists  all  the  tests  and  Inventories  administered  at  Pilot 
Test  2  along  with  the  time  limit  and  number  of  Items  for  each.  There  were  10 
new  cognitive  tests  with  5  cognitive  marker  tests,  and  2  new  non-cognltlve 
Inventories,  with  1  non-cognltlve  marker  Inventory.  Mo  computerized  measures 
were  administered. 


Pilot  Test  3:  Fort  Lewis 

For  the  third  pilot  testing  session,  approximately  24  soldiers  per  day 
for  5  days  (11-15  June  1984)  were  available  for  testing  at  Fort  Lewis, 
Washington.  A  total  of  118  soldiers  participated.  Their  mean  age  and  time 
In  the  Army  were  22.8  and  2.5  years,  respectively.  There  were  97  men  and  22 
women,  and  66  whites,  30  blacks,  and  14  Hispanlcs.  They  were  distributed 
over  a  wide  range  of  MCS.  Test  sessions  ran  from  0800  hours  to  1700  hours 
with  short  breaks  In  the  morning  and  afternoon,  and  a  1-hour  lunch  break. 
The  entire  Pilot  Trial  Battery,  Including  new  cognitive  and  non-cognitive 
measures,  was  administered  to  all  soldiers. 

Once  again,  the  new  paper-and-penci  1  cognitive  tests  were  administered 
in  two  equally  timed  halves  to  investigate  Part  1/Part  2  correlations  as 
estimates  of  test  reliability.  Individuals  were  not  allowed  any  extra  time 
to  work  on  each  test  beyond  the  tine  limits,  but  finishing  times  v/ere 
recorded  for  individuals  completing  tests  before  time  was  called. 


Table  II. 1 

Tests  of  Pilot  Trial  Battery  Administered  at  Fort  Carson  (17  April  1984) 


Test 

Time 

Limit 

(Mins.) 

No.  of 
Items 

Type  of  Test 

Paper-and-PencIl  Tests 

1.  Path  Test 

9 

35 

New,  Cognitive 

2. 

Reasoning  Test  1 

14 

30 

New,  Cognitive 

3. 

EAS  Test  1  -  Verbal  Comprehension 

5 

30 

Marker,  Cognitive 

4. 

Orientation  Test  1 

8 

20 

New,  Cognitive 

5. 

Shapes  Test 

16 

54 

New,  Cognitive 

6. 

EAS  Test  2  -  Numerical  Ability 

10 

75 

Marker,  Cognitive 

7. 

Object  Rotation  Test 

7 

60 

New*,  Cognitive 
Marker,  Cognitive 

8. 

ETS  Choosing  a  Path 

8 

16 

9. 

Orientation  Test  2 

8 

20 

New,  Cognitive 

10. 

Reasoning  Test  2 

11 

32 

New,  Cognitive 

11. 

Orientation  Test  3 

12 

20 

New,  Cognitive 

12. 

Assembling  Objects  Test 

16 

30 

New,  Cognitive 

13. 

Maze  Test 

9 

24 

New,  Cognitive 

14. 

Mental  Rotations  Test 

10 

20 

Marker,  Cognitive 

15. 

ETS  Hidden  Figures 

14 

16 

Marker,  Cognitive 

16. 

ETS  Map  Planning 

6 

40 

Marker,  Cognitive 
Marker,  Cognitive 

17. 

ETS  Figure  Classification 

8 

14 

18. 

EAS  Test  5  -  Space  Visualization 

5 

50 

Marker,  Cognitive 

19, 

FIT  Assembly 

10 

20 

Marker,  Cognitive 

Computer  Measures3 

1.  Simple  Reaction  Time 

None 

1.5 

New,  Perceptual' 
Psychomotor 

New,  Perceptual/ 
Psychomotor 

New,  Perceptual/ 
Psychomotor 

New,  Perceptual/ 

2. 

Choice  Reaction  Time 

None 

15 

3. 

Perceptual  Speed  8  Accuracy 

None 

80 

4. 

Tracing  Test 

None 

26 

5. 

Short  Memory  Test 

None 

50 

Psychomotor 

New,  Perceptual/ 

6. 

Hidden  Figures  Test 

None 

32 

Psychomotor 

New,  Perceptual/ 

7. 

Target  Shoot 

None 

20 

Psychomotor 

New,  Perceptual/ 

Psychomotor 


3  All  computer  measures  were  administered  using  a  Compaq  portable  micro¬ 
processor  with  a  standard  keyboard  plus  a  commercially  available  dual-axis 
joystick. 


Table  II. 2 

Pilot  Tests  Administered  at  Fort  Campbell  (16  May  1984) 


Paper-and-Pendl  Tests 


Time 

Limit  No.'  of 
(Mins.)  Items 


i 


1. 

Path  Test 

9 

44 

New,  Cognitive 

1  2. 

Reasoning  Test  1 

14 

30 

New,  Cognitive 

|  3. 

EAS  Test  1  -  Verbal  Comprehension 

5 

30 

Marker,  Cognitive 

K  4. 

Orientation  Test  1 

9 

30 

New,  Cognitive 

i  5* 

Shapes  Test 

16 

54 

New,  Cognitive 

6. 

Object  Rotation  Test  2 

9 

90 

New,  Cognitive 

7. 

Reasoning  Test  2 

11 

32 

New,  Cognitive 

8. 

Orientation  Test  2 

8 

20 

New,  Cognitive 

9. 

ABLE  (Assessment  of  Background  and 
Life  Experiences) 

None 

291 

New,  Non-Cognltlve 

1  10. 

Orientation  Test  3 

12 

20 

New,  Cognitive 

0  11. 

Assembling  Objects  Test 

16 

40 

New,  Cognitive 

8  12. 

Maze  Test 

8 

24 

New,  Cognitive 

B  13, 

AVOICE  (Army  Vocational  Interest 
Career  Examination) 

None 

306 

New,  Non-Cognltlve 

!  14. 

ETS  Hidden  Figures 

14 

16 

Marker,  Cognitive 

j  15. 

ETS  Map  Planning 

6  • 

40 

Marker,  Cognitive 

16. 

ETS  Figure  Classification 

8 

14 

Marker,  Cognitive 

!(  17. 

FIT  Assembly 

10 

20 

Marker,  Cognitive 

i  ^ 

POI  (Personal  Opinion  Inventory) 

None 

121 

Marker,  Non-Cognltlve 

After  each  soldier  completed  the  computer-administered  battery,  he  or 
she  was  asked  about  general  reactions  to  the  computerized  battery,  the 
clarity  and  completeness  of  the  Instructions,  the  perceived  difficulty  of  the 
tests,  and  the  ease  of  using  the  response  apparatus. 


I 


Tests  Administered 

The  tests  administered  at  Pilot  Test  3,  In  Fort  Lewis,  are  listed  in 
Table  II. 3  with  the  time  limit  and  number  of  items  In  each  test. 
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Table  II. 3 


Pilot  Tests  Administered  at.  Fort  Lewis  (11-15  June  1984) 


Time 

Adml nlstration 

Limit 

No.  of 

Group  Test 

(Mins.) 

Items 

Type  of  Test 

Paper-and-PencIl  Tests 

Path  Test 

8 

44 

New,  Cognitive 

Reasoning  Test  1 

12 

30 

New,  Cognitive 

Cl  Orientation  Test  1 

10 

30 

New,  Cognitive 

Shapes  Test 

16 

54 

New,  Cognitive 

Object  Rotation  Test 

8 

90 

New,  Cognitive 

Reasoning  Test  2 

10 

32 

Marker,  Cognitive 

Maze  Test 

6 

24 

New,  Cognitive 

SRA  Word  Grouping 

5 

30 

Marker,  Cognitive 
New,  Cognitive 

Orientation  Test  2 

10 

24 

C2  Orientation  Test  3 

12 

20 

New,  Cognitive 

Assembling  Objects  Test 

16 

40 

New,  Cognitive 

ETS  Map  Planning 

16 

40 

Marker,  Cognitive 

Mental  Rotations  Test 

10 

20 

Marker,  Cognitive 

DAT  Abstract  Reasoning 

13 

25 

Marker,  Cognitive 

NC  ABLE 

None 

268 

New,  Non-Cognltl ve 

AVOICE 

None 

306 

New,  Non-Cognltl ve 

Computerized  Measures9 

Simple  Reaction  Time 

None 

15 

New,  Perceptual/ 
Psychomotor 

Choice  Reaction  Time 

None 

15 

New,  Perceptual/ 

Psychomotor 

Perceptual  Speed  S  Accuracy 

None 

80 

New,  Perceptual/ 

Psychomotor 

Target  Tracking  Test  1 

None 

18 

New,  Perceptual/ 

Psychoinotor 

Target  Tracking  Test  2 

None 

18 

New,  Perceptual/ 

Psvchomotor 

Target  Identification  Test 

None 

44 

New,  Perceptual/ 

Psychomotor 

Memory  Test 

None 

50 

New,  Perceptual/ 

Target  (Shoot)  Test 

Psychoinotor 

None 

40 

New,  Perceptual/ 
Psychomotor 

a  All  computer  measures  were  administered  via 

a  custom 

-made 

response  pedestal 

designed  specifically  for  this  purpose.  No 

responses  were 

made  on  the  com- 

puter  keyboard.  A  Compaq  microprocessor  was 

used. 
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Summary  cf  Pilot  Tests 

The  Pilot  Test  Battery,  initially  developed  in  March  1984,  went  through 
three  pilot  testing  iterations  by  August  1984.  After  each  iteration,  obser¬ 
vations  noted  during  administration  were  scrutinized,  data  analyses  were  con¬ 
ducted,  and  the  results  were  carefully  examined.  Revisions  were  made  In 
specific  item  content,  test  length,  and  time  limits,  where  appropriate. 
Table  II. 4  summarizes  the  three  Pilot  Test  sessions  conducted  during  this 
period,  with  the  total  sample  size  for  each,  and  the  number  and  types  of 
tests  administered  at  each. 


Table  I I. 4 

Su,«nary  of  Pilot  Testing  Sessions  for  Pilot  Trial  Battery 


Number/Type 

Pilot 

of 

Test  No, 

Location 

Date 

1 

Fort  Carson 

17  April 

43 

10 

New  Cognitive 

1984 

9 

Marker  Cognitive 

0 

New  Non-Cognitlve 

0 

Marker  Non-Cognitlve 

7 

Computerized  Measures 

2 

Fort  Campbell 

16  May 

57 

10 

New  Cognitive 

1984 

5 

Marker  Cognitive 

2 

New  Non-Cognitlve 

1 

Marker  Non-Cognitlve 

0 

Computerized  Measures 

3 

Fort  Lewis 

11-15  June 

118 

10 

New  Cognitive 

1984 

4 

Marker  Cognitive 

2 

New  Non-Cognltive 

0 

Marker  Non-Cognltive 

8 

Computerized  Measures 

The  following  sections  in  Part  II  contain  discussions  of  each  test, 
Inventory,  and  measure  in  the  Pilot  Trial  Battery,  Its  evolution  through  the 
pilot  testing  process,  and  its  status  as  of  the  end  of  August  1984. 


Section  3 


DEVELOPMENT  OF  COGNITIVE  PAPER-AND-PENCI L  MEASURES 


This  section  describes  the  development  of  the  paper-and-penci 1  cognitive 
predictor  measures,  up  to  the  point  at  which  ♦hey  were  ready  for  field 
testing  as  part  of  the  Pilot  Trial  Battery.  As  described  previously, 
cognitive  ability  constructs  had  been  evaluated  and  prioritized  according  to 
their  judged  relevance  and  Importance  for  predicting  success  In  a  variety  of 
the  Army  MOS.  These  priority  judgments  were  used  to  plan  the  development 
activities  for  cognitive  paper-and-penci 1  tests. 

Each  cognitive  predictor  category  Is  discussed  in  turn.  Within  each 
category  are  a  definition  of  the  target  cognitive  ability  and  an  outline  of 
the  strategy  followed  to  develop  the  measure(s)  of  the  target  ability.  This 
Includes  identifying  (a)  the  target  population  or  target  MOS  for  which  the 
measure  Is  hypothesized  to  most  effectively  predict  success;  ^b)  published 
tests  that  served  as  markers  for  each  new  measure;  (c)  Intended  level  of  item 
difficulty,  and  (d)  type  of  test  (l.e.,  speed,  power,  or  a  combination).  The 
test  Itself  Is  then  described  and  example  Items  are  provided.  Results  from 
the  first  two  pilot  test  administrations  or  tryouts  are  reported,  to  explain 
and  document  subsequent  test  revisions.  Finally,  psychometric  test  data 
obtained  from  the  third  pilot  test,  conducted  at  Fort  Lewis,  are  discussed. 

The  last  portion  of  this  section  presents  a  summary  of  the  newly  devel¬ 
oped  cognitive  ability  tests.  This  includes  a  discussion  of  test  Intercor¬ 
relations,  results  from  a  factor  analysis  of  the  Intercorrelations,  and 
results  from  subgroup  analyses  of  test  scores. 


General  Issues 


Before  describing  the  individual  tests,  we  would  like  to  summarize 
certain  general  Issues  germane  to  all  the  cognitive  paper-and-pencll 
measures. 

Target  Population 

The  population  for  which  these  tests  have  been  developed  is  the  same  one 
to  which  the  Army  supplies  the  ASVAB,  that  is,  persons  applying  to  enlist  In 
the  Army.  However,  that  target  population  was,  practically  speaking,  in¬ 
accessible  during  the  development  process.  We  were  constrained  to  thp  use  of 
Incumbents.  Enlisted  soldiers  represent  a  restricted  sample  of  the  target 
population  In  that  they  all  have  passed  enlistment  standards  and, 
furthermore,  almost  all  of  the  soldiers  that  we  were  able  to  use  In  our  pilot 
tests  had  also  passed  Basic  and  Advanced  Individual  Training.  Thus,  they  are 
presumably  mare  qualified,  more  able,  more  persevering,  and  so  forth,  on  the 
average,  than  are  the  Individuals  In  the  target  population.  We  tried  to  take 
this  into  account  for  (a)  developing  tests  that  have  a  broad  range  of  item 
difficulties  and  (b)  selecting  items  of  somewhat  lower  difficulty. 
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Another  decision  to  be  made  about  each  test  was  Its  placement  on  the 
power  vs.  speed  continuum.  Host  psychometricians  would  agree  that  a  "pure" 
power  test  Is  a  test  administered  in  such  a  way  that  each  person  is  allowed 
enough  time  to  attempt  all  Items,  and  that  a  "pure"  speeded  test  Is  a  test 
administered  In  such  a  way  that  no  one  taking  the  test  has  enough  time  to 
attempt  all  of  the  Items.  In  practice,  most  tests  fall  somewhere  between  the 
two  extremes.  It  also  Is  the  case  that  a  power  test  usually  contains  Items 
that  not  all  persons  could  answer  correctly,  even  given  unlimited  time  to 
complete  the  test,  while  a  speeded  test  usually  contains  Items  that  all  or 
almost  all  persons  could  answer  correctly,  given  enough  time  to  attempt  the 
Items. 

As  a  matter  of  practical  definition,  an  "80%  completion"  rule-of-thumb 
was  used  to  define  a  power  test.  That  is,  If  a  test  could  be  completed  by 
80%  of  all  those  taking  the  test,  then  we  considered  It  a  "power"  test. 


Reliability 

Several  procedures  are  available  to  assess  the  reliability  of  a  measure 
and  each  provides  distinct  information  about  a  test.  Split-half  reliability 
estimates  were  obtained  for  each  paper-and-pencil  test  administered  at  the 
first  three  pilot  test  sites.  For  each  pilot  test,  each  test  was  adminis¬ 
tered  In  two  separately  timed  parts.  Reliability  estimates  are  obtained  by 
correlating  scores  from  the  two  parts.  The  Spearman-Brown  correction  pro¬ 
cedure  was  then  used  to  estimate  the  reliability  for  the  whole  test.  This 
estimate  of  reliability  is  appropriate  for  either  speeded  or  power  tests. 

Hoyt  internal  consistency  reliability  estimates  are  also  reported  for 
each  test,  providing  the  average  reliability  across  all  possible  split-test 
halves.  Ths  procedure  Is  less  appropriate  for  speeded  tests  because  It 
overestimates  the  reliability. 


Individual  Test  Descriptions 

We  turn  now  to  the  descriptions  of  the  individual  tests,  which  are  dis¬ 
cussed  within  cognitive  ability  constructs.  This  description  is  given  In 
some  detail  because  these  are  new  measures  that  are  of  fundamental  Importance 
for  the  basic  goals  of  Project  A.  As  mentioned  above,  a  standard  format  Is 
used  to  describe  the  development  of  each  instrument.  Readers  who  are  not 
Interested  In  the  specifics  of  predictor  content  may  wish  to  turn  to  the 
summary  sections. 


Construct  -  Spatial  Visualization 

Spatial  v  isual  ual  ion  involves  the  ability  to  mentally  manipulate  com¬ 
ponents  of  two-  or  three-dimensional  figures  into  other  arrangements.  The 
process  involves  restructuring  the  components  of  an  object  and  accurately 
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discerning  their  appropriate  appearance  in  new  configurations.  This 

construct  includes  several  subcomponents,  two  of  which  are: 

•  Rotation  the  ability  to  Identify  a  two-dimensional  figure  when 
seen  at  different  angular  orientations  within  the  picture  plane. 
It  also  Includes  three-dimensional  rotation  or  the  ability  to 
Identify  a  three-dimensional  object  projected  on  a  two-dimensional 
plane,  when  seen  at  different  angular  orientations  either  within 
the  picture  plane  or  about  the  axis  in  depth. 

•  Scanning  -  the  ability  to  visually  survey  a  complex  field  to  find  a 
particular  configuration  representing  a  pathway  through  the  field. 

Currently,  no  ASVAB  measures  are  designed  specifically  to  measure 
spatial  abilities.  Because  of  this,  spatial  visualization  received  a  devel¬ 
opmental  priority  rating  of  one  (see  Figure  II. 4).  The  visualization  con¬ 
struct  was  divided  into  two  parts:  vlsualizatlon/rotatlon  and  visualization/ 
scanning.  We  developed  two  tests  within  each  of  these  areas;  these  four 
tests  are  described  below. 


Spatial  Visualization  -  Rotation 

The  two  tests  developed  for  this  ability  are  Assembling  Objects  and 
Object  Rotation.  The  former  Involves  three-dimensional  figures,  while  the 
latter  Involves  two-dimensional  objects. 

Assembling  Objects  Test 

Development  Strategy.  Predictive  validity  estimates  provided  by  expert 
raters  suggest  that  measures  of  the  vlsualizatlon/rotatlon  construct  would  be 
effective  predictors  of  success  In  MOS  that  Involve  mechanical  operations  and 
construction  and  drawing  or  using  maps.  The  Assembling  Objects  Test  was 
designed  to  yield  Information  about  the  potential  for  success  In  such  MOS. 

Published  tests  Identified  as  markers  for  Assembling  Objects  Include  the 
Employee  Aptitude  Survey  Space  Visualization  (EAS-5)  and  the  Flanagan 
Industrial  Test  (FIT)  Assembly.  EAS-5  requires  examinees  to  count  three- 
dimensional  objects  depicted  in  two-dimensional  space,  whereas  the  FIT 
Assembly  Involves  mentally  piecing  together  objects  that  are  cut  apart  or 
disassembled.  The  FIT  Assembly  was  selected  as  the  more  appropriate  marker 
for  our  purposes  because  It  has  both  visualization  and  rotation  components 
involving  mechanical  or  construction  activities.  The  Assembling  Objects  Test 
was  designed  to  assess  the  ability  to  visualize  how  an  object  will  look  when 
Its  parts  are  put  together  correctly.  It  was  Intended  that  this  measure 
would  combine  power  and  speed  components,  with  speed  receiving  greater 
emphasis. 

Test  Description.  In  Lhe  original  form  of  U,n  ‘*s»embl  i ng  ObjecU  Test, 
subjects  were  asked  to  complete  30  Items  within  a  16-minute  time  limit.  Each 
Item  presents  subjects  with  components  or  parts  of  an  object.  The  task  is  to 
select,  from  among  four  alternatives,  the  one  object  that  depicts  the  com¬ 
ponents  or  parts  put  together  correctly.  The  two  item  types  are  Included  In 
the  test;  examples  of  each  are  shown  In  Figure  II. 5. 
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Figure  I I. 5  Sample  Items  from  Assembling  Objects  Test. 


Results  from  the  first  tryout,  conducted  at  Fort  Carson,  Indicated  that 
the  test  may  have  suffered  from  celling  effects.  That  Is,  nearly  all 
recruits  In  this  sample  (N  ■  36)  completed  the  test  and  their  mean  score  was 
24.2  (SD  ■  5.05).  Further,  Item  difficulty  levels  were  somewhat  higher  than 
Intended  (mean  ■  .80,  SO  *  .12,  median  ■  .83). 

Therefore  10  new,  more  difficult  Items,  five  for  each  item  type,  were 
constructed  and  added  to  the  test  to  reduce  the  likelihood  of  celling 
effects.  The  16-minute  time  limit  was  retained  for  the  second  tryout,  at 
Fort  Campbell.  Nearly  all  subjects  completed  the  test  (mean  *  37.3,  SO  » 
4.75)  and  the  mean  score  was  26.3  (SD  ■  8.34,  N  ■  56).  Item  difficulty 
levels  were  lower  for  the  revised  test  (mean  3  .68,  SD  3  .15,  median  =  .72). 
Inspection  of  these  results  Indicated  that  the  test  possessed  acceptable 
psychometric  qualities,  so  no  changes  were  made  In  preparation  for  the  Fort 
Lewis  pilot  test. 

Test  Characteristics.  At  Fort  Lewis  the  Assembling  Objects  Test  con- 
talned  4(5  Items  with  a""  16-minute  time  limit.  The  mean  number  of  Items 
completed,  standard  deviation,  and  range  were  37.6,  3.83,  and  18  to  40, 
respectively.  Corresponding  values  for  number  correct  (or  test  score)  were 
28.1,  7.51,  and  7  to  40.  Item  difficulties  range  from  .31  to  .92  with  a  mean 
of  .70  (SD  ■  14.7),  Item-total  correlations  range  from  .18  to  ,60  with  a 
mean  of  .44  (SO  »  9.99).  Parts  1  and  2  correlate  .65  with  each  other. 
Reliabilities  are  estimated  at  .79  by  split-half  methods  (Spearman-Brown 
corrected),  and  .89  with  Hoyt's  estimate  of  reliability. 

Correlations  between  scores  on  this  measure  and  scores  on  other  Pilot 
Trial  Battery  paper-pencil  measures  are  reported  at  the  end  of  this  section. 
It  Is  Important,  however,  to  note  the  correlations  between  this  test  and 
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marker  tests.  Both  marker  tests  were  administered  in  the  Fort  Carson  tryout 

and  the  FIT  Assembly  was  also  used  at  Fort,  Campbell.  Results  from  Fort 
Carson  indicate  that  scores  on  Assembling  Objects  correlate  .74  with  scores 
ori  EAS-5  and  .76  with  scores  on  FIT  Assembly  (N  »  30).  Results  from  Fort 
Campbell  Indicate  that  this  test  correlates  .04  with  FIT  Assembly  (N  *  54). 
This  last  value  represents  a  better  estimate  of  the  relationship  between 
Assembling  Objects  and  Its  marker,  FIT  A*,,embly,  because  of  the  revisions 
made  to  Assembling  Objects  following  the  first  tryouc  at  Fort  Carson. 

Modifications  for  the  Fort  Knox  Field  Test.  In  preparation  for  the  Fort 
Knox  administration,  some  Assembling  Objects  items  were  redrawn  to  clarify 
the  figures.  The  Item  response  format  was  modified  to  approximate  a  format 
suitable  for  machine  scoring,  a  change  that  was  made  In  all  of  the  tests 
being  prepared  for  field  test  administration. 

Object  Rotation  Test 

Development  Strategy.  Published  tests  serving  as  markers  for  the  Object 
Rotation  measure  include  Educational  Testing  Service's  (ETS)  Card  Rotations, 
Thurstone's  Flags  Test,  and  Shephard-Metzler  Mental  Rotations.  Each  of  these 
measures  requires  the  subject  to  compare  a  test  object  with  a  standard  object 
to  determine  whether  the  two  represent  the  same  figure  with  one  simply  turned 
or  rotated  or  whether  the  two  represent  different  figures.  The  first  two 
measures,  ETS  Card  Rotations  and  Thurstone's  Flags,  Involve  visualizing  two- 
dimensional  rotation  of  an  object,  whereas  the  Mental  Rotations  test  requires 
visualizing  three-dimensional  objects  depicted  In  two-dimensional  space. 

Object  Rotation  Test  items  were  constructed  to  reflect  a  limited  range 
of  Item  difficulty  levels  ranging  from  very  easy  to  moderately  easy,  and 
designed  to  be  easier  than  those  in  the  Assembling  Objects  Test.  The  new 
test  had  more  Items  and  a  shorter  time  limit  than  the  Assembling  Objects 
Test. 


Test  Description.  The  Initial  version  contained  60  items  with  a 
7-minute  time  limit.  The  subject's  task  Involved  examining  a  test  object  and 
determining  whether  the  figure  repn  .ented  In  each  Item  Is  the  same  as  the 
test  object,  only  rotated,  or  Is  <ot  the  same  as  the  test  object  (e.g., 
flipped  over).  For  each  test  object  there  are  five  test  Items,  each  requir¬ 
ing  a  response  of  "same"  or  "not  same."  Sample  test  Items  are  shown  In 
Figure  II. 6. 

Results  from  the  Fort  Carson  administration  indicated  that  this  test 
suffered  from  celling  effects.  For  example,  item  difficulty  levels  averaged 
.92  (SO  ■  .05).  Therefore,  we  decided  to  add  30  new  items  to  the  test  and  to 
Increase  the  time  limit  to  9  minutes  for  the  second  tryout  at  Fort  Campbell. 

Results  from  the  second  tryout  Indicated  that  subjects,  on  the  average, 
completed  87.6  (SO  =  8.0)  of  the  90  items  and  obtained  a  mean  score  of  77.0 
(SD  *  12.1).  The  time  limit  was  reduced  to  8  minutes  for  the  Fort  Lewis 
administration  to  obtain  a  more  highly  speeded  test. 
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TEST  OBJECTS 
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Figure  II. 6.  Sample  items  from  Object  Rotation  Test. 


Test  Characteristics.  Detailed  results  from  the  Fort  Lewis  pilot  test 
showed  fairly  high  completion  rates  (mean  a  84.6  and  SD  a  10.8),  with  a  range 
of  48  to  90.  Test  scores,  computed  by  the  total  number  correct,  ranged  from 
36  to  90  with  a  mean  of  73.4  (SD  3  15,4),  Item  difficulty  levels  range  from 
.59  to  .93  with  a  mean  of  .81  (SD  ■  .11).  Item-total  correlations  (Item 
validities)  average  .44  (SD  a  .17),  ranging  from  .09  to  .79.  Parts  1  and  2 
correlate  .73  with  each  other.  The  split-half  reliability  estimate,  cor¬ 
rected  for  test  length,  Is  .86  while  the  Hoyt  estimate  Is  .96. 

The  marker  test  for  Object  Rotation,  Mental  Rotations,  was  administered 
at  two  of  the  three  pilot  test  sites.  Data  collected  at  the  Fort  Carson 


11-36 


K 

fc 

J ft 

E'i 


tryout  indicate  that  the  two  measures  correlate  .60  (N  -  30),  whereas  data 
from  Fort  Lewis  indicate  the  two  correlate  .56  (N  -  118). 


Modifications  for  the  Fort  Knox  Field  Test.  Results  from  the  Fort  Lewis 
pilot  test  Indicated  that  all  test  Ttems  possessed  desirable  psychometric 
properties.  However,  the  time  limit  was  decreased  to  7.5  minutes  to  make  the 
test  even  more  speeded  and  avoid  a  possible  ceiling  effect.  The  response 
format  was  modified  to  approximate  a  format  suitable  for  machine  scoring. 


Spatial  Visualization  -  Scanning 

The  second  component  of  spatial  visualization  ability  which  was  empha¬ 
sized  in  predictor  development  is  spatial  scanning.  Spatial  scanning  tasks 
require  the  subject  to  visually  survey  a  complex  field  and  find  a  pathway 
through  it,  utilizing  a  particular  configuration.  The  Path  Test  and  the  Maze 
Test  were  developed  to  measure  this  component  of  spatial  visualization. 

Path  Test 


Development  Strategy.  Published  tests  serving  as  markers  for  construc¬ 
tion  of  the  Path  Test  include  ETS  Map  Planning  and  ETS  Choosing  a  Path.  In 
thse  measures,  examinees  are  provided  with  a  map  or  diagram.  The  task  is  to 
follow  a  given  set  of  rules  or  directions  to  proceed  through  the  pathway  or 
to  locate  an  object  on  the  map. 

Results  from  aarlier  research  with  the  marker  tests,  ETS  Map  Planning 
and  ETS  Choosing  a  Path,  indicated  that  both  tests  are  highly  speeded  and 
were  very  difficult  for  the  target  sample  (Hough  et  a!.,  1984).  For  example, 
80X  of  the  subjects  (Army  recruits)  completed  only  16  of  the  40  items  con¬ 
tained  in  the  Map  Planning  Test,  and  only  5  of  the  16  items  in  the  Choosing  a 
Path  Test.  Consequently,  Path  Test  items  ware  constructed  to  yield  diffi¬ 
culty  levels  for  the  target  population  ranging  from  very  easy  to  somewhat 
difficult  and  the  test  time  was  established  to  place  more  emphasis  on  speed 
than  on  power. 

Test  Description.  The  Path  Test  requires  subjects  to  determine  the  best 
path  or  route  between  two  points.  Subjects  are  presented  with  a  map  of  air¬ 
line  routes  or  flight  paths.  Figure  II. 7  contains  a  flight  path  with  four 
sample  Items.  The  subject's  task  is  to  find  the  "best"  path  or  the  path  be¬ 
tween  two  points  that  requires  the  fewest  number  of  stops.  Each  lettered  dot 
Is  a  city  that  counts  as  one  stop;  the  beginning  and  ending  cities  (dots)  do 
not  count  as  stops. 

In  its  original  form,  the  Path  Test  contained  35  Items  with  a  9-mlnute 
time  limit.  Subjects  were  asked  to  record  the  numbers  of  stops  for  each  item 
in  a  corresponding  blank  space. 

Results  from  the  first  tryout,  conducted  at  Fort  Carson,  revealed  that 
the  test  was  too  easy.  That  Is,  virtually  all  of  the  subjects  completed  the 
test  and  they  obtained  a  mean  scone  of  29.9.  item  difficulty  levels  ranged 
from  .48  to  1.00  with  a  mean  of  .35.  To  reduce  the  potential  for  celling 
effects,  an  additional  map  or  flight  path  with  13  items  was  added  to  the 
test.  In  addition,  four  very  easy  items  were  deleted,  resulting  in  44  items 
on  the  revised  test.  The  9-minute  limit  was  retained. 


The  routs  Cron: 


1.  A  to  F 

2.  G  to  E 

3.  C  to  D 

4.  C  to  F 
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Figure  II. 7.  Sample  Items  from  Path  Test. 


Results  from  the  second  tryout  Indicate  that,  on  the  average,  subjects 
completed  40.7  Items  and  obtained  a  mean  score  of  32.6  (SD  *  7.0).  Item  dif¬ 
ficulty  levels  ranged  from  .55  to  .96  with  a  mean  of  .80.  To  prepare  for  the 
third  tryout,  conducted  at  Fort  Lewis,  the  test  response  format  was  revised 
to  allow  subjects  to  circle  the  number  of  stops  ( 1  .e . ,  1-5)  Instead  of  fill¬ 
ing  In  a  blank.  In  addition,  the  time  limit  was  reduced  from  9  minutes  to  8 
minutes  to  Increase  the  speededness  of  the  test. 

Test  Characteristics.  In  results  from  the  Fort  Lewis  tryout  of  the 
revised  Path  Test,  subjects,  on  the  average,  completed  35.3  of  the  44  Items 
(SD  «  8.3).  Test  scores,  computed  by  the  total  number  correct,  ranged  from  0 
to  44  with  a  mean  of  28.3  (SD  ■  9.1).  Item  difficulty  levels  range  from  .20 
to  .91  with  a  mean  of  .64.  Item-total  correlations  average  .47  with  a  range 
of  .25  to  .69.  Parts  1  and  2  correlate  .70.  The  split-half  reliability 
estimate,  corrected  for  test  length,  Is  .82.  The  Hoyt  Internal  consistency 
value  Is  .92. 

One  or  both  marker  tests  were  administered  at  all  pilot  test  sites. 
Data  from  the  first  tryout  Indicate  that  the  original  Path  Test  correlates 
.34  with  ETS  Choosing  a  Path  and  -.01  with  ETS  Map  Planning.  The  reader'  Is 
reminded  that  results  from  Fort  Carson  are  based  on  a  very  small  sample  site 
(N  a  19)  and  that  the  Path  Test  was  greatly  modified  following  this  tryout. 

The  ETS  Map  Planning  Test  was  also  administered  at  the  Fort  Carson  and 
Fort  Lewis  tryouts.  These  data  Indicate  that  the  Path  Test  and  Map  Planning 
correlate  .62  (N»  54)  and  .48  (N  =  118),  respectively. 

Modifications  for  the  Fort  Knox  Field  Test.  The  Path  Test  remained 
unchanged  except  tTfat  the  response  format  was  modified  to  approximate  a 
format  suitable  for  machine  scoring. 


Maze  Test 


Development  Strategy.  The  Maze  Test  is  the  second  measure  constructed 
to  assess  spatial  visualization/scanning.  The  development  strategy  mirrors 
that  of  the  Path  Test,  with  the  same  marker  tests.  The  Maze  Test,  however, 
differs  from  the  Path  Test  in  that  the  task  required  involves  finding  the  one 
pathway  that  allows  exit  from  the  maze,  while  the  Path  Test  was  designed  to 
measure  visualization/scanning  ability  under  highly  speeded  conditions. 

Test  Description.  For  the  first  pilot  test  administration  the  Maze  Test 
contained  24  rectangular  mazes.  Each  included  four  entrance  points,  labeled 
A,  B,  C,  and  D,  and  three  exit  points  Indicated  by  an  asterisk  (*).  The  task 
is  to  determine  which  of  the  four  entrances  leads  to  a  pathway  through  the 
maze  and  to  one  of  the  exit  points.  A  9-minute  limit  was  established. 

Results  from  the  first  tryout,  at  Fort  Carson,  Indicate  that  the  Maze 
Test  suffered  from  ceiling  effects.  Subjects,  on  the  average,  completed  23.3 
of  the  24  items  aid  obtained  a  mean  score  of  22.1  (SD  ■  2. IB).  To  Increase 
test  score  variance,  the  test  was  modified  in  two  ways.  First,  an  additional 
exit  was  added  to  each  test  maze.  Figure  II. 8  contains  a  sample  item  from 
the  original  test  and  the  same  item  modified  for  the  Fort  Campbell  tryout. 
Second,  the  time  limit  was  reduced  from  9  to  B  minutes. 

Data  obtained  at  the  second  tryout,  conducted  at  Fort  Campbell,  Indicate 
that  completion  rates  were  again  high  (mean  ■  22.5).  Therefore,  for  the 
third  tryout  the  time  limit  for  completing  the  24  maze  items  was  reduced  to  6 
minutes. 

Test  Characteristics.  Results  from  the  Fort  Lewis  tryout  indicate  that 
the  reduced  time  produced  a  drop  in  the  completion  rate  for  the  Fort  Lewis 
sample  (mean  =  20.6).  Test  scores,  computed  by  the  total  number  correct, 
ranged  from  8  to  24  with  a  mean  of  19.3  (SD  ®  4.4).  Item  difficulty  levels 
range  from  .41  to  .98  with  a  mean  of  .81.  Item-total  correlations  average 
.48  (SD  =  .22).  Parts  1  and  2  correlate  .64  with  each  other.  The  split-half 
reliability  estimate  corrected  for  test  length  is  .78  and  the  Hoyt 
reliability  estimate  for  this  test  is  .80. 

One  or  both  of  the  marker  tests,  ETS  Choosing  a  Path  and  ETS  Map  Plan¬ 
ning,  were  administered  at  the  three  pilot  test  sites.  Results  from  Fort 
Carson  indicate  that  the  Maze  Test  correlates  .24  (H  ®  29)  with  Choosing  a 
Path,  and  .36  (N  =  30)  with  Map  Planning.  These  values  must  be  viewed  with 
caution  because  of  the  small  sample  size  and  because  of  modifications  made  to 
the  Maze  Test  following  this  tryout. 

Map  Planning  was  also  administered  at  the  Fort  Campbell  and  Fort  Lewis 
tryouts.  Data  collected  at  these  posts  indicate  that  Map  Planning  correlates 
.45  (N  =  55)  and  .6?  (N  =  118),  respectively,  with  the  Maze  Test. 

Modifications  for  the  Fort  Knox  Field  Test.  Results  from  the  last  pilot 
test  administration  showed  that  tKe  Maze  Test  could  be  slightly  more 
speeded.  The  percentage  of  subjects  completing  this  test  is  higher  than  for 
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Figure  II. 8.  Sample  Items  from  Maze  Test. 


the  Path  Test  (l.e.,  38%  for  the  Maze  Test,  and  19%  for  the  Path).  There¬ 
fore,  the  time  limit  was  reduced  from  6  minutes  to  5.5  minutes  for  the  Fort 
Knox  field  test.  In  addition,  the  response  format  was  modified  to  approx¬ 
imate  that  for  machine  scoring. 


Construct  -  Field  Independence 


This  construct  Involves  the  ability  to  find  a  simple  form  when  It  is 
hidden  in  a  complex  pattern.  Given  a  visual  percept  or  configuration,  field 
independence  refers  to  the  ability  t.o  hold  thr  percept  or  configuration  in 
mind  so  as  to  dlsembed  it  from  other  well-derined  perceptual  material. 
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This  construct  received  a  mean  validity  estimate  of  .30  from  the  panel 

of  expert  judges,  with  the  highest  estimate  of  .37  appearing  for  HOS  that 
involve  detecting  and  identifying  targets.  Field  independence  received  a 
priority  rating  of  two  for  inclusion  in  the  battery.  One  instrument,  the 
Shapes  Test,  was  developed  to  measure  this  construct. 

Shapes  Test 

Development  Strategy.  The  marker  test  for  the  Shapes  Test  is  the  ET5 
Hidden  Figures  Test,  a  measure  included  in  the  Preliminary  Battery  (Hough  et 
al.,  1084).  In  this  test,  subjects  are  asked  to  find  one  of  five  simple 
figures  located  in  a  more  complex  pattern.  Initial  analyses  of  the  Prelimi¬ 
nary  Battery  results  indicated  that  for  the  target  population,  first-term 
enlisted  soldiers,  the  Hidden  Figures  Test  suffers  from  limited  test  score 
variance  and  possibly  floor  effects.  For  example,  the  initial  data  indicate 
that  £0%  of  the  sample  completed  fewer  than  4  of  the  16  test  items. 

Cur  strategy  for  constructing  the  Shapes  Test  was  to  use  a  task  similar 
to  that  in  the  Hidden  Figures  Test  while  ensuring  that  the  difficulty  level 
of  test  items  was  geared  more  toward  the  Project  A  target  population. 
Further,  we  decided  to  include  more  types  of  items  that  reflect  varying 
difficulty  levels.  We  wanted  the  test  to  be  speeded,  but  not  nearly  so  much 
as  the  ETS  Hidden  Figures  Test. 

Test  Description.  At  the  top  of  each  test  page  are  five  simple  shapes; 
below  these  shapes  are  six  complex  figures.  Subjects  are  instructed  to  ex¬ 
amine  the  simple  shapes  and  then  to  find  the  one  simple  shape  located  in  each 
complex  figure  (see  Figure  II. 9). 

In  the  first  tryout  at  Fort  Carson,  the  Shapes  Test  contained  54  items 
with  a  16-minute  time  limit.  Results  from  this  first  tryout  indicated  that 
most  subjects  were  able  to  complete  the  entire  test  and  most  subjects  ob¬ 
tained  very  high  scores  (mean  score  B  49.3). 

To  prepare  for  the  Fort  Campbell  tryout,  nearly  all  test  Items  were 
modified  to  increase  item  difficulty  levels.  Examples  of  item  modifications 
are  provided  in  Figure  II. 9.  As  is  shown,  by  adding  a  few  lines  to  each 
complex  pattern,  the  test  items  administered  at  Fort  Campbell  were  made  more 
difficult  than  the  items  administered  at  the  Fort  Carson  tryout. 

Results  from  Fort  Campbell  indicate  that  test  item  modifications  were 
successful.  Subjects,  on  the  average,  completed  43.5  of  the  54  items  within 
the  16-minute  time  limit,  and  obtained  a  mean  score  of  30.9  (SD  =  23.5). 

This  test  was  modified  only  slightly  for  the  Fort  Lewis  administration. 
For  example,  a  few  complex  figures  were  revised  to  ensure  that  one  and  only 
one  simple  figure  could  be  located  in  each  complex  figure. 

Test  Characteristics.  For  the  Fort  Lewis  sample  the  mean  number  com- 
pleted  was  42.4.  The"  mean  number  correct  was  29.3  (SD  =  9.1)  with  a  range  of 
)2  to  51,  indicating  that  the  measure  does  not  suffer  from  ceiling  effects. 
Item  difficulty  levels  range  from  .10  to  .97  with  a  mean  of  .54.  Reliability 
estimates  Indicate  that  Parts  1  and  2  correlate  .69;  the  Spearman-Brown 
correction  is  .P2,  and  the  Hoyt  reliability  estimate  is  .89. 
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Fort  Carson 


Fort  Campbell 


The  marker  test,  ETS  Hidden  Figures  Test,  was  administered  at  the  first 
two  tryouts.  Results  from  the  Fort  Carson  tryout  indicate  that  the  original 
version  of  the  Shapes  Test  correlates  .35  with  the  Hidden  Figures  Test  (N  « 
29).  Data  from  Fort  Campbell  Indicate  that  the  revised  Shapes  Test  cor¬ 
relates  .50  with  Its  marker  (N  -  56). 

Modifications  for  the  Fort  Knox  Field  Test.  The  Shapes  Test  required 
only  minor  revisions  for  this  final  tryout,  for  example.  Item-total  correla¬ 
tions  for  a  few  Items  Indicated  that  more  than  one  shape  could  be  located  In 
a  complex  figure  test  Item,  so  these  figures  were  modified.  In  addition,  the 
response  format  was  changed  to  approximate  that  for  machine  scoring. 


Construct  -  Spatial  Orientation 

This  construct  Involves  the  ability  to  maintain  one's  bearings  with 

respect  to  points  on  a  compass  and  to  maintain  appreciation  of  one's  location 

relative  to  landmarks  In  the  environment. 

This  particular  construct  was  not  included  In  the  list  of  predictor 
constructs  evaluated  by  the  expert  panel.  However,  conceptualization  and 
measurement  of  this  ability  construct  first  appeared  during  World  War  II, 
when  researchers  for  the  Army  Air  Forces  (AAF)  Aviation  Psychology  Program 
explored  a  variety  of  constructs  to  aid  in  selecting  air  crew  personnel. 

Results  from  the  AAF  Program  indicated  that  measures  of  spatial  orientation 
were  useful  In  selecting  pilots  and  navigators  (Guilford  &  Lacey,  1947), 
Also,  during  the  second  year  of  Project  A,  a  number  of  job  observations 

suggested  that  some  MOS  Involve  critical  job  requirements  of  maintaining 
directional  orientation  and  establishing  location,  using  features,  or 
landmarks  In  the  environment.  Consequently,  three  different  measures  of  this 
construct  were  formulated. 


Orientation  Test  1 

Development  Strategy.  Direction  Orientation  Form  B  (CP515B)  developed 
by  researchers  in  the  AAF  Aviation  Psychology  Program  served  as  the  marker 
for  Orientation  Test  1.  The  strategy  for  developing  Orientation  Test  1 
involved  generating  items  that  duplicated  the  task  in  the  AAF  test.  Each 
item  contains  six  circles.  The  first,  the  standard  compass  or  %iven" 
circle,  Indicates  the  direction  of  North.  This  is  usually  rotated  out  of  the 
typical  or  conventional  position.  The  remaining  circles  are  test  compasses 
that  also  have  directions  marked  on  them.  For  this  test,  Item  construction 
was  limited  to  one  of  seven  possible  directions:  South,  East,  West, 
Southwest,  Northwest,  Southeast,  and  Northeast.  The  plan  for  this  test  was 
to  ask  subjects  to  complete  numerous  compass  directional  Items  within  a  short 
period  of  time.  Orientation  Test  1  was  designed  as  a  highly  speeded  test  of 
spatial  orientation. 

Test  Description.  Each  test  item  presented  subjects  with  six  circles. 
In  the  test's  original  form,  the  first,  or  Given,  circle,  indicated  the 
compass  direction  for  North.  For  most  items,  North  was  rotated  out  of  its 
conventional  position.  Compass  directions  also  appeared  on  the  remaining 
five  circles.  The  subject's  task  was  to  determine,  for  each  circle,  whether 


or  not  the  direction  indicated  was  correctly  positioned  by  comparing  it  to 
the  direction  of  North  in  the  Given  circle.  (See  Example  1  in  Figure  II. 10.) 

When  administered  to  the  Fort  Carson  sample,  this  test  contained  20  item 
sets  requiring  100  responses  (l.e.,  for  every  Item,  compass  directions  on 
five  circles  must  be  evaluated).  Subjects  were  given  8  minutes  to  complete 
the  test.  Test  scores  were  determined  by  the  total  number  correct;  the 
maximum  possible  was  100. 

Results  from  this  administration  Indicated  that  nearly  all  subjects 
completed  the  Items  within  the  time  allotted  and  the  mean  score  was  82.7  (SO 
■  17.8).  Item  difficulty  levels  indicate  that  most  Items  were  moderately 
easy  (mean  »  .83).  For  the  Fort  Campbell  tryouts,  we  attempted  to  create 
more  difficult  Items  by  modifying  directional  Information  provided  In  the 
Given  circle.  That  Is,  rather  than  Indicating  the  direction  for  North, 
compass  directions  for  South,  East,  and  West  were  provided.  These  directions 
were  also  rotated  out  of  conventional  compass  position.  (See  Example  2, 
Figure  11.10.)  Orientation  Test  1,  as  presented  at  the  Fort  Campbell  tryout, 
contained  30  Item  sets  or  150  items,  administered  In  three  separately  timed 
parts.  Parts  1  and  2  included  the  original  test  Items,  and  Part  3  the  new 
(non-North)  items.  Part  3  was.  preceded  by  additional  test  Instructions 
Informing  subjects  about  the  change  in  Given  circle  directions.  Subjects 
were  given  3  minutes  to  complete  each  part. 
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Figure  II. 10. 

Sample  items  from 

Orientation 

Test  1. 

In  this  second  tryout,  scores  on  Part  3  yielded  lower  correlations  with 
Parts  1  and  2  (both  are  equal  to  .44);  Parts  1  and  2  correlated  .87.  From 
this  Information  we  reasoned  that  the  new  Items  were  assessing  additional 
Information  about  subjects'  abilities  to  maintain  orientation.  Item  sets 
from  Part  3  were  then  mixed  with  Item  sets  from  Parts  1  and  2  to  create  a 
test  with  30  Item  sets  (150  Items)  for  the  Fort  Lewis  tryout.  Further,  the 
time  limit  was  Increased  to  a  total  of  10  minutes.  Test  Instructions  were 
modified  to  explain  that  Items  vary  throughout  the  test  with  respect  to 
Information  provided  In  the  Given  circle. 

Test  Characteristics.  At  the  Fort  Lewis  tryout  completion  rates  for  the 
total  test  indicated  that,  on  the  average*,  examinees  attempted  25  of  the  30 
Item  sets  and  obtained  a  mean  score  of  117.8  (SD  ■  24.1).  Item  difficulty 
levels  range  from  .21  to  .97  with  a  mean  of  .79.  The  correlation  between 
Parts  1  and  2  Is  .85.  Reliability  estimates  are  .92  for  split-half 
Spearman-Brown  corrected  and  .97  for  Hoyt. 

Modifications  for  the  Fort  Knox  Field  Test.  Very  few  changes  were 
made.  Response  fonnatwas  modified  to  approximate  a  format  scorable  by 
machine.  Orientation  Test  1  contains  30  Item  sets  (150  Items)  with  a 
10-minute  time  limit. 


Orientation  Test  2 

Development  Strategy.  The  second  measure  of  spatial  orientation  was 
also  designed  to  tap  abilities  that  might  predict  success  for  M0S  that 
Involve  maintaining  appreciation  of  one's  location  relative  to  landmarks  In 
the  environment  or  In  spite  of  frequent  changes  In  direction.  Orientation 
Test  2  Is  a  relatively  new  approach  to  assessing  spatial  orientation 
abilities.  Although  no  particular  test  served  as  Its  model,  It  Is  similar  to 
a  measure  designed  by  Army  Air  Force  researchers  to  select  pilots, 
navigators,  and  bombardiers  (Directional  Orientation:  CP5150). 

The  task  designed  for  Orientation  Task  2  asks  subjects  to  mentally 
rotate  objects  and  then  to  visualize  how  components  or  parts  of  those  objects 
will  appear  after  the  object  Is  rotated.  Item  difficulty  levels  were  varied 
by  altering  the  degree  of  rotation  required  to  correctly  complete  each  part 
of  the  task.  Because  of  the  complexity  of  the  task,  Orientation  2  was 
Initially  viewed  as  power  test  of  spatial  orientation. 

Test  Description.  For  Orientation  Test  2,  each  Item  contains  a  picture 
within  a  circular  or  rectangular  frame.  The  bottom  of  the  frame  has  a  circle 
with  a  dot  Inside  It.  The  picture  or  scene  Is  not  In  an  upright  position. 
The  task,  then,  is  to  mentally  rotate  the  frame  so  that  the  bottom  of  the 
frame  Is  positioned  at  the  bottom  of  the  picture.  After  doing  so,  one  must 
then  determine  where  the  dot  will  appear  In  the  circle.  (See  Figure  II. 11 
sample  Items.)  For  the  Fort  Carson  tryout,  this  test  contained  20  items  with 
an  8-mlnute  time  limit. 

Results  from  this  administration  Indicate  that  the  time  limit  was 
sufficient  (mean  number  completed  *  19.9),  but  that  Item  difficulty  levels 
were  somewhat  lower  than  desired  (mean  *  .52).  Item-total  correlations  were, 
however,  Impressive  (mean  ■  .48,  SD  =  .10).  The  only  potential  problem 
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Figure  II. 11.  Sample  Items  from  Orientation  Test  2. 


Involved  the  test  Instructions,  which  some  respondents  found  difficult.  For 
the  Fort  Campbell  tryout,  test  instructions  were  modified  to  clarify  the 
nature  of  the  task. 

Data  collected  at  Fort  Campbell  provide  Information  similar  to  the  Fort 
Carson  data;  however,  the  mean  score  and  Item  difficulty  levels  Indicated 
that  the  test  was  more  difficult  for  this  group  than  for  the  Fort  Carson 
sample  (mean  score  ■  8.6;  mean  Item  difficulty  ■  .43).  Because  of  these 
lower  Item-difficulty  levels,  four  new,  easier  items  were  added. 
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Orientation  Test  2,  as  administered  to  the  Fort  Lewis  sample,  contained 

24  Items,  and  a  10-minute  time  limit  was  established  to  correspond  to  the 
Increase  In  the  number  of  Items.  Test  scores  on  this  measure  are  determined 
by  the  total  number  correct. 

Test  Characteristics.  The  Fort  Lewis  data  Indicate  that  Orientation 
Test  2  Is  a  power  test  (mean  number  completed  ■  23.7,  SD  ■  1.0).  Subjects 
obtained  a  mean  score  of  11.5  (SO  ■  6.2).  Item  difficulty  levels  range  from 
.19  to  .71  with  a  mean  of  .48;  this  represents  a  slight  Increase  from  the 
Fort  Campbell  tryout.  Item-total  correlations  remain  high  (mean  of  .53), 
Scores  from  Parts  1  and  2  correlate  .80.  Correcting  this  value  for  test 
length  yields  a  split-half  reliability  estimate  of  .89.  The  Hoyt  Internal 
consistency  value  Is  also  .89. 

Modifications  for  the  Fort  Knox  Field  Test.  This  measure  was  virtually 
unchanged  for  the  Fort  Knox  administration.  Only  the  response  format  was 
modified  to  approximate  machine  scoring. 


Orientation  Test  3 


Development  Strategy.  This  test  was  also  designed  to  measure  spatial 
orientation  and  was  modeled  after  another  spatial  orientation  test,  Compass 
Directions,  developed  by  researchers  In  the  AAF  Aviation  Psychology  Program. 
The  AAF  measure  was  designed  to  assess  the  ability  to  reorient  oneself  to  a 
particular  ground  pattern  quickly  and  accurately  when  compass  directions  are 
shifted  about.  Orientation  Test  3  was  designed  to  assess  the  same  ability 
using  a  similar  test  format.  This  test  was  designed  to  place  somewhat  more 
emphasis  on  speed  than  on  power. 

Test  Description.  In  Its  original  form,  Orientation  Test  3  presented 
subjects  with  a  map  that  Includes  various  landmarks  such  as  a  barracks,  a 
campsite,  a  forest,  a  lake.  Within  each  Item,  subjects  are  provided  with 
compass  directions  by  Indicating  the  direction  from  one  landmark  to  another, 
such  as  "the  forest  is  North  of  the  campsite."  Subjects  are  also  informed  of 
their  present  location  relative  to  another  landmark.  Given  this  Information, 
the  subject  must  determine  which  direction  to  go  to  reach  yet  another 
structure  or  landmark.  Figure  11.12  contains  one  test  map  and  two  sample 
Items.  Note  that  for  each  Item,  new  or  different  compass  directions  are 
given. 

For  the  Fort  Carson  tryout,  the  test  contained  two  maps  with  10 
questions  about  each  map,  for  a  total  of  20  items.  Subjects  were  given  12 
minutes  to  complete  the  test.  Results  from  this  first  tryout  revealed  very 
few  problems  with  the  test  (e.g.,  test  instructions  were  clear,  the  time  wa; 
sufficient,  no  floor  nor  celling  effects  appeared).  Thus,  this  measure 
remained  unchanged  for  the  cort  Campbell  pilot  test. 

Results  from  the  second  tryout  yielded  similar  Information  (e.g.,  no 
celling  nor  floor  effects,  completion  rates  were  acceptable).  These  data, 
however,  Indicated  that  for  a  few  items,  two  responses  could  be  correct 
because  of  a  lack  of  precision  In  drawing  the  two  maps.  Accordingly, 
landmarks  on  each  map  were  repositioned  to  ensure  that  one  and  only  one 


1.  The  force!  l«  duo  «nl  of  the  barrack*.  You  an  at  haadquarlara.  Which  direction  muat  you  travol  In  order 
la  reach  the  tiring  range? 

1.  N  INI  3.  I  4.  M  I.  •  4.  3W  7.  W  3.  NW 

3.  The  firing  ran? a  la  aouthwool  ot  the  hoaprtai  You  are  at  the  farm.  Which  direction  muat  you  travel  In  order 
to  roach  tha  oampatta? 

i.  n  a.  ni  a.  i  4.  si  a,  a  a.  aw  i.  w  a.  nw 


Figure  11.12.  Sample  Items  from  Orientation  Test  3. 


correct  answer  existed  for  each  item.  When  administered  to  the  Fort  Lewis 
sample,  Orientation  Test  3  contained  20  test  Items  with  a  12-minute  time 
limit.  Test  scores  are  determined  by  the  total  number  correct. 

Test  Characteristics.  On  the  average,  subjects  at  Fort  Lewis  completed 
18.0  Items  (SD  ■  2.7).  The  mean  score  of  8.7  (SD  ■  5.8)  Indicates  that 
subjects  correctly  answered  about  one-half  of  the  Items  attempted.  Item 
difficulty  levels  range  from  .24  to  .63  with  a  mean  of  .43.  Item-total 
correlations  range  from  .48  to  .72  with  a  mean  of  .59  (SD  *  .07).  Part  1  and 
Part  2  correlate  .79.  The  split-half  reliability  estimate  corrected  for  test 
length  Is  .88,  while  the  Hoyt  Internal  consistency  estimate  is  .90. 


Data  from  Fort  Carson  indicate  that  Orientation  Test  3  correlates  .66 

with  Orientation  Test  1  (N  ■  29)  and  .42  with  Orientation  2  (N  »  31).  Values 
for  these  same  measures  administered  at  Fort  Campbell  are  .72  and  .54  (N  a 
56).  Oata  from  Fort  Lewis  indicate  that  these  measures  correlate  .68  and  .65 
(N  -  118). 

Modifications  for  the  Fort  Knox  Field  Test.  This  test  was  virtually 
unchanged  for  the  Fort  Knox  field  test,  the  onTy  exception  Involves  changes 
to  approximate  a  machlne-scorable  response  format. 


Construct  -  Induction/FI gural  Reasoning 

This  construct  Involves  the  ability  to  generate  hypotheses  about 
principles  governing  relationships  among  several  objects. 

Example  measures  of  Induction  Include  the  Employee  Aptitude  Survey 
Numerical  Reasoning  (F.AS-6),  Educational  Testing  Service  (ETS)  Figure 
Classification,  Differential  Aptitude  Test  (DAT)  Abstract  Reasoning,  Science 
Research  Associates  (SRA)  Word  Grouping,  and  Raven's  Progressive  Matrices, 
These  paper-and-pencll  measures  present  subjects  with  a  series  of  objects 
such  as  figures,  numbers,  or  words.  To  complete  the  task,  subjects  must 
first  determine  the  rule  governing  the  relationship  among  the  objects  and 
then  apply  the  rule  to  Identify  the  next  object  In  the  series.  Two  different 
measures  of  the  construct  were  developed  for  Project  A. 


Reasoning  Test  1 

Development  Strategy.  The  plan  for  developing  Reasoning  Test  1  was  to 
construct  a  test  that  was  similar  to  the  task  appearing  In  EAS-6,  Numerical 
Reasoning,  but  with  one  major  difference:  Items  would  be  composed  of  figures 
rather  than  numbers.  Test  Items  were  constructed  to  represent  varying 
degrees  of  difficulty  ranging  from  very  easy  to  very  difficult.  Following 
Item  development  activities,  time  limits  were  established  to  allow  sufficient 
time  for  subjects  to  complete  all  or  nearly  all  Items.  Thus,  Reasoning  Test 
1  was  designed  as  a  power  measure  of  induction. 


Test  Description.  Reasoning  1  test  items  present  subjects  with  a  series 
of  four  figures;  the- task  Is  to  identify  from  among  five  possible  answers  the 
one  figure  that  should  appear  next  In  the  series.  In  the  original  test, 
subjects  were  asked  to  complete  30  Items  In  14  minutes.  Sample  Items  are 
provided  In  Figure  11.13, 

Results  from  the  first  tryout,  conducted  at  Furt  Carson,  indicate 
subjects,  on  the  average,  completed  29.5  items  and  obtained  a  mean  score  of 
20.7  (SO  ■  3.5),  Inspection  of  difficulty  levels  indicated  that  Items  were 
unevenly  distributed  between  the  two  test  parts,  so  Items  were  reordered  to 
ensure  that  easy  and  difficult  Items  were  equally  distributed  throughout  both 
test  parts.  Only  minor  modifications  were  made  to  test  items;  for  example, 
one  particularly  difficult  item  was  redrawn  to  reduce  the  difficulty  level. 


Figure  11,13.  Sample  Items  from  Reasoning  Test  1. 


Data  collected  at  Fort  Campbell  Indicate  that  again  nearly  all  subjects 
completed  the  test.  Further,  test  administrators  reported  that  those  who 
completed  the  test  finished  early.  As  a  consequence,  the  14-minute  time 
limit  was  reduced  to  12  minutes.  Also,  two  items  were  revised  because  de¬ 
tractors  yielded  higher  Item-total  correlations  than  the  correct  response. 

Test  Characteristics.  Fort  Lewis  subjects,  on  the  average,  completed 
29.4  items  with  "about  34£  of  the  subjects  completing  the  entire  test.  Test 
scores,  computed  from  the  total  number  correct,  ranged  from  4  to  29  with  a 
mean  of  19.6.  Item  difficulty  levels  range  from  .24  to  .92  with  a  mean  of 
.66.  Part  ,1  and  Part  2  correlate  .64.  The  split-half  reliability  estimate 
corrected  for  test  length  is  .76,  while  the  Hoyt  value  Is  .86. 

Two  other  measures  of  Induction,  SRA  Word  Grouping  and  DAT  Abstract 
Reasoning,  were  administered  at  the  Fort  Lewis  tryout.  Results  Indicate  that 
Reasoning  Test  1  correlates  .47  with  Word  Grouping  and  .74  with  Abstract  Rea¬ 
soning.  These  data  are  compatible  with  our  understanding  of  the  two  marker 
measures  of  Induction.  The  former  contains  a  verbal  component  while  the 
latter  measures  Induction  via  flgural  reasoning. 

Modifications  for  the  Fort  Knox  Field  Test.  Test  instructions  were 
revised  slightly  for  the  Fort  Knox  fieTT  test,  and  the  response  format  was 
modified  to  approximate  that  used  for  machine  scoring. 


Reasoning  Test  2 

Development  Strategy.  This  measure  also  was  designed  to  assess  Induc¬ 
tion  using  items  that  require  flgural  reasoning. 


Published  tests  serving  as  markers  for  Reasoning  2  include  EAS-6,  Numer¬ 
ical  Reasoning,  and  ETS  Figure  Classification.  The  original  development 

strategy  was  to  develop  Reasoning  Test  2  fairly  similarly  to  the  ETS  test. 
Initial  analyses  conducted  on  ETS  Figure  Classification  data  (N  »  1,863) 
Indicated  that  this  test  was  too  highly  speeded  for  the  target  population 
(Hough  et  al.,  1984).  For  example,  80%  of  recruits  taking  the  Figure  Clas¬ 
sification  test  finished  fewer  than  half  of  the  112  Items.  Further,  although 
Item  difficulty  levels  varied  greatly,  the  mean  value  Indicated  most  Items 
are  moderately  easy  (mean  ■  73,  SD  -  22). 

Therefore,  although  the  ETS  Figure  Classification  test  served  as  the 
marker  In  early  test  development  planning  for  Reasoning  2,  the  new  measure 
differed  In  several  ways.  First,  ETS  Figure  Classification  requires  subjects 
to  perform  two  tasks:  to  Identify  similarities  and  differences  among  groups 
of  figures  and  then  to  classify  test  figures  Into  those  groups.  Items  In 
Reasoning  Test  2  were  designed  to  Involve  only  the  first  task.  Second,  test 
Items  were  constructed  to  reflect  a  wide  range  of  difficulty  levels,  with  the 
average  Item  falling  In  the  moderately  difficult  range.  Finally,  because  the 
Items  would  be  more  difficult  overall,  Test  2  could  contain  fewer  Items.  The 
test  was  thus  designed  as  a  power  measure  of  flgural  reasoning,  with  a  broad 
range  of  Item  difficulties. 

Test  Description.  Test  Items  present  five  figures.  Subjects  are  asked 
to  determine  which  Tour  figures  are  similar  In  some  way,  thereby  Identifying 
the  one  figure  that  differs  from  the  others.  (See  Figure  11.14.)  This  test, 
when  first  administered,  contained  32  Items  with  an  11-minute  time  limit. 

Results  from  the  Fort  Carson  tryout  Indicated  that  nearly  all  subjects 
completed  the  entire  test.  Item  difficulty  levels  were  somewhat  higher  than 
expected,  ranging  from  .05  to  1.0  with  a  mean  of  .71  (SD  •»  .29).  Because 
eight  of  the  test  Items  yielded  item  difficulty  levels  of  .97  or  above,  these 
items  were  either  modified  or  replaced  to  Increase  Item  difficulties.  More¬ 
over,  inspection  of  Item  difficulties  Indicated  that  Part  I  contained  a 
greater  proportion  of  the  easier  Items,  so  Items  were  redistributed 
throughout  the  test. 

For  the  Fort  Campbell  tryout,  Reasoning  Test  2  again  contained  32  Items 
with  an  11-minute  time  limit.  Analysis  of  the  data  Indicated  desirable 
psychometric  qualities.  For  example,  nearly  all  subjects  completed  the 
test.  Test  scores  ranged  from  9  to  26  with  a  mean  of  19.1  (SD  ■  3.5)  and 
difficulty  levels  decreased.  Although  the  part-part  correlation  Increased 
from  the  first  tryout,  It  still  remained  low  (i.e.,  Fort  Campbell  r  =  .40 
versus  Fort  Carson  r  ®  .32). 

A  few  changes  were  made  in  the  test  prior  to  the  third  tryout.  For 
example,  four  items  contained  a  distractor  that  was  selected  more  often  and 
that  yielded  a  higher  item-total  correlation  than  the  correct  response;  these 
items  were  revised.  Test  administrators  at  Fort  Campbell  noted  that  the  time 
limit  could  be  reduced  without  significantly  altering  test  completion  rates, 
so  the  limit  was  reduced  to  10  minutes  for  the  next  administration. 


Figure  11.14.  Sample  Items  from  Reasoning  Test  2. 


Test  Characteristics.  In  the  third  tryout  70"  completed  the  entire  test 
(however,  completed  the  separately  timed  first  half  and  79%  completed  the 


second  half).  Scores  ranged  from  11  to  28  with  a  mean  of  21.8  (SO  *  3.4). 
Item  difficulties  range  from  .17  to  1.0  with  a  mean  of  .64.  Parts  1  and  2 
correlate  .46.  The  split-half  reliability  estimate  corrected  for  test  length 
Is  .63  while  the  Hoyt  value  Is  .61.  These  values  suggest  that  this  Is  a  more 
heterogeneous  test  of  flgural  reasoning  than  is  Reasoning  Test  1. 


The  marker  test,  ETS  Figure  Classification,  was  administered  at  the 
first  two  tryouts.  Correlations  between  Reasoning  Test  2  and  Its  marker  are 
.35  (N  ■  30  at  Fort  Carson)  and  .23  (ft  =  56  at  Fort  Campbell).  These  low 
correlations  are  not  too  surprising,  given  the  task  requirement  differences 
and  the  power  versus  speed  component  differences  between  these  two  measures. 
Two  other  measures  of  induction,  SRA  Word  Grouping  and  DAT  Abstract  Reason¬ 
ing,  were  administered  at  the  third  tryout.  These  data  Indicate  that 
Reasoning  Test  2  correlates  .48  with  Word  Grouping  and  .66  with  Abstract 
Reasoning  (N  -  118).  Once  again,  these  differences  in  correlations  are  as 
expected,  since  Lord  Grouping  contains  a  verbal  component,  whereas  Abstract 
Reasoning,  like  Reasoning  Test  2,  assesses  induction  using  flgural  Items. 
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Summary  of  Pilot  Test  Results  for 

Cognitive  Paper-and-PencIl  Measures 


In  this  section,  we  summarize  the  data  available  as  of  August  1984  for 
the  10  cognitive  paper-and-pendl  measures.  This  Includes  test  score  Infor¬ 
mation,  Intercorrelatlons  among  the  10  measures,  and  results  from  factor 
analyses.  The  bulk  of  the  data  reported  here  was  obtained  from  the  Fort 
Lewis  tryout.  Table  11.5  summarizes  the  Fort  Lewis  data.  All  data  are  based 
on  a  sample  size  of  118,  with  the  exception  of  the  Path  Test,  which  Is  based 
on  a  sample  size  of  116. 


Table  11.5 


Cognitive  Paper-and-PencIl  Measures: 
Test  Results 


Sumnary  of  Fort  Lewis  Pilot 


Measure 

No.  of 
Items 

Mean 

Score 

SD 

Mean  Item- 
Difficulty 
Level s 

Split- 

Half® 

rxx 

SPATIAL  VISUALIZATION 

Rotation 

Assembling  Objects 

■  40 

28.14 

7.51 

.70 

.79 

Object  Rotation 

90 

73.36 

15.40 

.82 

.86 

Scanning 

Path 

44 

28.28 

9.08 

.64 

.82 

Mazes 

24 

19.30 

4.35 

.80 

.78 

FIELD  INDEPENDENCE 

Shapes 

54 

29.28 

9.14 

.54 

.8? 

SPATIAL  ORIENTATION 

Orientation  1 

150 

117.86 

24.16 

.79 

.92 

Orientation  2 

24 

11.53 

6.20 

.43 

.89 

Orientation  3 

20 

8.71 

5.78 

.44 

.88 

REASONING 

Reasoning  1 

30 

19.64 

5.75 

.66 

.78 

Reasoning  2 

3? 

21.82 

3.38 

.64 

.63 

All  reliability  estimates  (split  halves  with  Part  1-Part  2  separately 
timed)  have  been  corrected  with  the  Spearman-Brown  procedures. 
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Table  II. 6  contains  the  inte.  correlation  matrix  for  the  10  cognitive 

ability  measures.  One  of  the  most  obvious  features  of  this  matrix  is  the 
high  level  of  correlation  across  all  measures.  The  correlations  across  all 
test  pairs  range  from  .40  to  .68.  These  data  suggest  that  the  test  measures 
overlap  In  the  abilities  assessed. 

This  finding  Is  not  altogether  surprising.  For  example,  4  of  the  10 
measures  were  designed  to  measure  spatial  abilities  such  as  visualization, 
rotation,  and  scanning.  The  Shapes  Test,  designed  to  measure  field  Inde¬ 
pendence,  also  includes  visualization  components.  The  three  tests  con¬ 
structed  to  measure  spatial  orientation  Involve  visualization  and  rotation 
tasks.  The  final  two  measures,  Reasoning  Test  1  and  Reasoning  Test  2,  also 
require  visualization  at  some  level  to  Identify  the  principal  governing  re¬ 
lationships  among  figures  and  to  determine  the  similarities  and  differences 
among  figures.  Thus,  across  all  measures,  abilities  needed  to  complete  the 
required  tasks  overlap  to  some  degree.  This  overlap  is  demonstrated  In  the 
intercorrelation  matrix. 

To  provide  a  better  understanding  of  the  underlying  structure,  the 
intercorrelatlon  matrix  was  factor  analyzed.  Several  solutions  were  com¬ 
puted,  ranging  from  two  to  five  factors.  The  rotated  orthogonal  solution  for 
four  factors  appeared  most  meaningful.  Results  from  this  solution  appear  In 
Table  II. 7.  As  shown  In  the  table,  to  interpret  results  from  the  four-factor 
solution,  we  first  Identified  all  factor  loadings  of  .35  or  higher.  Next,  we 
examined  the  factor  loading  pattern  for  each  measure  and  then  Identified  mea¬ 
sures  with  similar  patterns  to  form  test  clusters.  Five  test  clusters  or 
groups,  labeled  A  through  E,  are  Identified  In  Table  II. 7.  These  clusters 
represent  a  first  attempt  to  Identify  the  underlying  structure  of  the  cogni¬ 
tive  measures  Included  In  the  Pilot  Trial  Battery.  Each  test  cluster  Is 

described  below. 

Group  A  -  Assembling  Objects  and  Shapes  Test.  Recall  that  the  Shapes 
Test  requires  the  subject  to  locate  or  dTsenbed  simple  forms  from  more  com¬ 
plex  patterns,  while  the  Assembling  Objects  Test  requires  the  subject  to 
visualize  how  an  object  will  appear  when  Its  components  are  put  together. 
Both  measures  require  subjects  to  visualize  objects  or  forms  In  new  or 

different  configurations.  Further,  these  measures  contain  both  power  and 
speed  components,  with  each  falling  more  tow  rd  the  speed  end  of  the 
continuum. 

Group  B  -  Object  Rotation,  Path,  and  Maze  Tests.  Object  Rotation  in¬ 
volve' s~Two;:HTmIn?TonaTTorSTTon^?^^  the  Path  and  Maze 

tests  Involve  visually  scanning  a  map  or  diagram  to  identify  the  best  pathway 
or  the  one  pathway  that  lends  to  an  exit.  These  measures  are  all  highly 

speeded;  that  is,  subjects  are  required  to  perform  the  tasks  at  a  fairly 
rapid  rate.  Further,  the  tasks  Involved  in  each  of  these  measures  appear 
less  complex  or  easier  than  those  involved  in  the  Assembling  Objects  or 

Shapes  tests. 

Group  C  -  Orientation  i est  1  and  Orientation  Test  3.  Orientation  Test  1 
requires  the  examinee  to  compare  compass  directions  provided  on  a  test  circle 
and  a  given  circle,  while  Orientation  Test  3  Involves  using  a  map,  compass 
directions,  and  present  location  to  determine  which  direction  to  go  to  reach 
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Table  II. 6 

Intercorrelations  Among  the  10  Cognitive  Paper- and-Pencil  Measures 
Pilot  Test  Data3 
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correlations  are  computed  from  a  sample  size  of  118  except  those  involving  the  Path  Test 


Table  11.7 


Rotated  Orthogonal  Factor  Solution  for  Four  Factors  on  Cognitive 
Paper-and-PencIl  Measures:  Pilot  Test  Dataa 


I 

II 

III 

IV 

h2h 

Shapes 

.47 

'  .49 

A 

.568 

Assembling  Objects 

.47 

.48 

.621 

Object  Rotation 

.50 

.37 

.473 

Path 

.55 

B 

.40 

.541 

Mazes 

.76 

.727 

Orientation  1 

.39 

.57 

C 

.617 

Orientation  3 

.79 

.35 

.827 

Orientation  2 

.35 

.74  <—  0 

.684 

Reasoning  1 

.39 

.35 

.67 

.778 

Reasoning  2 

.37 

.36 

.44 

.521 

a  Factor  loadings  of  .35  or  higher  are  shown. 

*  Proportion  of  total  test  score  variance  In  common  with  other  tests, 
or  common  variance. 


a  landmark  on  the  map.  Both  measures  require  examinees  to  quickly  and 
accurately  orient  themselves  with  respect  to  directions  ori  a  compass  and 
landmarks  In  the  environment,  despite  shifts  or  changes  In  the  directions. 
Both  are  highly  speeded  measures  of  spatial  orientation. 

Group  D  -  Orientation  Test  2.  This  measure  involves  mentally  rotating  a 
frame  so  that  It  corresponds  to  or  matches  up  with  the  picture  Inside,  and 
then  visualizing  how  components  on  the  frame  (a  circle  with  a  dot)  will 
appear  after  It  has  been  rotated.  This  appears  to  be  a  very  complex  spatial 
measure  that  requires  several  abilities  such  as  visualization,  rotation,  and 
orientation.  In  addition  to  the  task  complexity  differences,  this  measure 
may  also  differ  from  other  spatial  measures  on  the  power-speed  continuum. 
Unlike  the  other  spatial  measures  Included  in  the  Pilot  Trial  Battery, 
Orientation  Test  2  Is  a  power  rather  than  a  speed  test. 
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Group  E  -  Reasoning  Test  1  and  Reasoning  Test  2.  Reasoning  Test  1 
requires  one  to  Identify  the  principle  governing  the  relationship  or  pattern 
among  several  figures,  while  Reasoning  Test  2  Involves  Identifying 
similarities  among  several  figures  to  Isolate  the  one  figure  that  differs 
from  the  others.  As  noted  above,  these  measures  appear  to  Involve 
visualization  abilities.  The  reasoning  task  Involved  In  each,  however, 
distinguishes  these  measures  from  the  other  tests  Included  In  the  Pilot  Trial 
Battery. 

Results  from  analyses  of  the  Fort  Lewis  data  provide  a  preliminary 
structure  for  the  cognitive  paper-and-pencil  tests  designed  for  the  Pilot 
Trial  Battery.  Correlations  among  the-  measures  indicate  that  all  measures 
require  spatial  visualization  abilities  at  some  level.  The  measures  may, 
however,  be  distinguished  by  the  type  of  task,  task  complexity,  and  speed  and 
power  component  differences. 

In  this  section  we  have  focused  on  the  cognitive  paper-and-pencil 
measures.  Other  cognitive  measures  in  the  Pilot  Trial  Battery  were 
administered  via  computer  and  are  described  In  the  following  section. 
Correlations  among  the  cognitive  paper-and-pencil  tests  and  the  cognitive 
computer-administered  tests  are  also  reported  In  that  section. 
Administration  and  results  of  the  field  tests  are  reported  in  Section  6. 


I 
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Section  4 


DEVELOPMENT  OF  COMPUTER-ADMINISTERED  TESTS 


In  this  section  the  development  steps  and  pilot  test  results  for  the 
computer-administered  measures  are  described.  Before  discussing  the  tests 
themselves,  we  will  briefly  describe  a  critical  piece  of  equipment  designed 
especially  for  pilot  administrations  of  the  computerized  tests  In  the  Pilot 
Trial  Battery. 

The  microprocessor  selected  for  use,  the  COMPAQ,  contains  a  standard 
keyboard,  and  In  early  tryouts  of  the  computer  battery  subjects  were  asked  to 
make  their  responses  on  this  keyboard.  These  preliminary  trials  suggested 
that  the  use  of  a  keyboard  may  provide  an  unfair  advantage  to  subjects  with 
typing  or  data  entry  experience,  and  that  the  standard  keyboard  did  not  pro¬ 
vide  adequate  experimental  control  during  the  testing  process.  Consequently, 
a  separate  response  pedestal  was  designed  and  built  and  was  ready  for  use  in 
the  final  pilot  test  at  Fort  Lewis. 

This  response  pedestal  is  depicted  In  Figure  11.15.  Note  that  It  con¬ 
tains  two  joysticks  (one  for  left-handed  subjects  and  one  for  right-handed 
subjects),  two  sliding  resistors,  a  dial  for  entering  demographic  data  such 
as  age  end  social  security  number,  two  red  buttons,  three  response  buttons— 
blue,  yellow,  and  white— and  four  green  "home1'  burtons.  (One  of  the  "home" 
buttons  Is  not  visible  In  the  diagram}  It  Is  located  on  the  side  of  the 
pedestal .) 

The  "home"  buttons  play  a  key  role  In  capturing  subjects'  reaction  time 
scores.  They  control  the  onset  of  each  test  Item  or  trial  when  reaction  time 
Is  being  measured.  To  begin  a  trial,  the  subject  must  place  his  or  her  hands 
on  the  four  green  buttons.  After  the  stimulus  appears  on  the  screen  and  the 
subject  has  determined  the  correct  response,  he  or  she  must  remove  his  or  her 
preferred  hand  from  the  "home"  buttons  and  press  the  correct  response  button. 

The  procedure  Involving  the  "home"  buttons  serves  two  purposes.  First, 
control  Is  added  over  the  location  of  the  subjects'  hands  while  the  stimulus 
Item  is  presented.  In  this  way,  hand  movement  distance  is  the  same  for  all 
subjects  and  variation  In  reaction  time  due  to  position  of  subjects'  hands  is 
reduced  to  nearly  zero. 

Second,  procedures  involving  these  buttons  are  designed  to  assess  two 
theoretically  Important  components  of  reaction  time  measures— decl slon  time 
and  movement  time.  Decision  time  Includes  the  period  between  stimulus' onset 
and  the  point  at  which  the  subject  removes  his  or  her  hands  to  make  a 
response;  this  Interval  reflects  the  time  required  to  process  the  information 
to  determine  the  correct  response.  Movement  time  Involves  the  period  between 
removing  one's  hands  from  the  "home"  buttons  and  striking  a  response  key. 
The  "home"  buttons  on  the  response  pedestal,  then,  are  designed  to  Investi¬ 
gate  the  tv. o  theoretically  Independent  components  of  reaction  time.  Results 
from  an  Investigation  of  these  measures  appear  throughout  this  section. 


5.  Response  pedestal  lor  computerized  tests. 


On  the  following  pages,  we  describe  the  development  and  pilot  testing  of 
computer-administered  tests  designed  to  measure  three  cognitive  ability  con¬ 
structs  and  two  psychomotor  ability  constructs.  Tests  were  also  developed  to 
measure  two  additional  constructs  but  were  not  pilot  tested. 


Construct  -  Reaction  Time  (Processing  Efficiency) 

This  construct  Involves  speed  of  reaction  to  stimuli— that  Is,  the  speed 
with  which  a  person  perceives  the  stimulus  Independent  of  any  time  taken  by 
the  motor  response  component  of  the  classic  reaction  time  measures.  Accord¬ 
ing  to  our  definition  of  this  construct,  which  Is  an  indicator  of  processing 
efficiency,  it  Includes  both  simple  and  choice  reaction  time. 


Simple  Reaction  Time;  Reaction  Time  Test  1 


The  basic  paradigm  for  this  task  stems  from  Jensen's  research  involving 
the  relationship  between  reaction  time  and  mental  ability  (Jensen,  1982). 


At  the  computer  console,  the  subject  Is  Instructed  to  place  his  or  her 
hands  on  the  green  "home"  buttons.  On  the  computer  screen,  a  small  box 
appears.  After  a  delay  period  (ranging  from  1.5  to  3.0  seconds)  the  word 
"yellow"  appears  In  the  box.  The  subject  must  remove  the  preferred  hand  from 
the  "home"  buttons  to  strike  the  yellow  key.  The  subject  must  then  return 
both  hands  to  the  ready  position  to  receive  the  next  Item. 


This  test  contains  15  Items.  Although  It  Is  self-paced,  subjects  are 
given  10  seconds  to  respond  before  the  computer  "time-outs"1  and  prepares  to 
present  the  next  item.  . 


Test  Characteristics.  Table  II. 8  contains  data  on  the  test  characteris¬ 
tics  from  the  Fort  Lewis  pilot  test.  Variables  in  the  upper  part  of  the 
table  provide  descriptive  information  about  test  performance.  Note  that,  on 
the  average,  subjects  read  the  test  Instructions  In  2.5  minutes,  although 
this  time  ranges  from  about  half  a  minute  to  5  minutes.  Further,  subjects 
completed  the  test  in  1.2  minutes;  this  ranged  from  .8  minute  to  over  5 
minutes.  Total  test  time,  then,  ranged  from  1.6  to  7.1  minutes  with  a  mean 
of  3.7  minutes.  Very  few  subjects  timed-out  or  provided  Invalid  responses; 
the  maximum  number  of  time-out1?  for  any  subject  was  three,  the  maximum  number 
of  Invalid  responses  was  one.  Finally,  percent-correct  values  indicate 
nearly  all  subjects  understood  the  task  and  performed  it  correctly. 


Ulme-outs  occur  If  a  subject  falls  to  respond  within  a  specified  period  of 
time.  Invalid  responses  occur  when  the  subject  strikes  the  wrong  key.  In 
both  cases,  the  item  disappears  from  the  computer  screen  and,  after  the 
subject  gets  in  the  ready  position,  the  next  item  appears  on  the  screen. 
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Table  II. 8 


Reaction  Time  Test  1  (Simple  Reaction  Time): 
Fort  Lewis  Pilot  Test 


Descriptive  Characteristics 

Mean 

SD 

Range 

Time  to  Read  Instructions  (minutes) 

2.51 

.81 

.63 

-  5.01 

Time  to  Complete  Test  (minutes) 

1.22 

.62 

.79 

-  5.19 

s 

Total  Test  Time  (minutes) 

3.72 

.99 

1.59 

-  7.10 

I 

Time-Outs  (number  per  person) 

.05 

.31 

0 

-  3 

1 

Invalid  Responses  (number  per  person) 

.07 

.26 

0 

-  1 

1 

Percent  Correct 

99% 

3 % 

80 

-  100% 

Test  Scores® 

Mean 

SD 

Range 

11 

|  Decision  Time  (10  items) 

30.50 

10.15 

17.90 

-  109.78 

*91  | 

i  Trimmed0  Decision  Time  (8  Items) 

29,25 

8.10 

18.75 

-  82.00 

.92  | 

SD  -  Decision 

7.85 

12.05 

.92 

-  118.26 

.77  & 

Movement  Time  (10  Items) 

27.35 

8.98 

15.50 

-  91.33 

.75  | 

Trimmed  Movement  Time  (8  Items) 

26.01 

7.26 

15.50 

-  55.86 

•94  rj 

*  SD  -  Movement 

• 

6.68 

12.77 

.75 

-  121.07 

.2°  6 

j  Total  Time  (10  Items) 

57.84 

15.78 

37.90 

-  149.56 

.90  $ 

(  Trimmed  Total  Time  (10  Items) 

55.92 

13.86 

37.75 

-  124.71 

.94 

SD  -  Total 

[ 

i 

11.79 

16.80 

1.58 

-  125.85 

.66 

a  All  values  reported  are  In  hundredths  of  a  second. 

b  Rxx  ■  odd-even  correlations,  corrected  to  full  test  length  using  the 
Spearman-Brown  formula. 

c  Trimmed  scores  are  based  on  response  to  Items  6-15,  excluding  the  highest 
and  lowest  scores. 


Test  Scores.  To  identify  variables  of  interest,  we  reviewed  the  litera¬ 
ture  TrTthTT'are a.  (See  Keyes,  198b.)  This  review  Indicated  that  the  re¬ 
action  time  Is  often  calculated  for  decision  time,  movement  time,  and  total 
time.  In  addition,  Intra-individual  variation  measures  (the  standard  devia¬ 
tion  of  total  reaction  time  scores)  calculated  for  each  subject  appear  to 
provide  useful  Information.  Considering  problems  related  to  practice 
effects,  only  the  last  10  responses  were  Included  In  the  mean  reaction 
scores.  Further,  because  subtle  events  may  produce  extreme  reaction  scores 
for  a  single  Item,  trimmed  scores,  which  Include  responses  to  Items  6  through 
15  with  the  highest  and  lowest  reaction  time  values  removed,  were  used  for 
decision,  movement,  and  total  time. 

Mean  values  for  all  the  above  measures  were  calculated.  They  appear  In 
Table  II. 8  along  with  reliability  estimates  for  each  measure,  computed  using 
an  odd-even  method  with  a  Spearman-Brown  correction. 


The  relationships  among  these  measures  of  reaction  time  were  examined  by 
computing  all  pairwise  correlations.  These  results  Indicate  that  a  low  to 
moderate  relationship  exists  between  movement  time  and  decision  time  (r  ■  .32 
for  10  Items).  Movement  time  appears  to  be  providing  kinds  of  Information 
similar  to  total  t1me‘(_r  ■  .77  for  10  Items).  Decision  time,  however,  pro¬ 
vides  additional  Information  (_r  ■  .50  for  10  Items). 

Correlations  calculated  with  paper-and-pencll  cognitive  measures 
Indicate  that  decision  time,  total  standard  deviation,  and  percentage  correct 
are  virtually  unrelated  to  scores  on  these  paper-and-pencll  measures.  Total 
reaction  time,  however,  correlates  highest  with  the  Maze  Test  (-.39),  the 
Path  Test  (-.23),  and  Orientation  Test  1  (-.23).  The  detailed  Information  on 
Intercorrelations  between  the  computer-administered  tests  and  the  cognitive 
paper-and-pencll  tests  Is  provided  in  the  final  portion  of  Section  4. 

Finally,  scores  on  these  measures  were  correlated  with  video  experi¬ 
ence.  Prior  to  completing  the  computer  tests,  subjects  had  been  asked  to 
rate,  on  a  5-polnt  scale,  their  degree  of  experience*  with  video  game  play¬ 
ing.  Mean  decision  trimmed  and  mean  total  trimmed  times  correlate  near  zero 
with  this  variable.  Total  standard  deviation  correlates  .19  and  percent 
correct  correlates  -.20  with  this  measure. 

Modifications  for  Fort  Knox  Field  Test.  This  test  remained  the  same  for 
the  Fort  Knox  field  test. 


Choice  Reaction  Time:  Reaction  Time  Test  2 

Reaction  time  for  two  response  alternatives  is  obtained  by  presenting 
the  term  BLUE  or  WHITE  on  the  screen.  The  subject  Is  Instructed  when  one  of 
these  appears,  to  move  his  or  her  preferred  hand  from  the  "home"  keys  to 
strike  the  key  that  corresponds  with  the  term  appearing  on  the  screen  (RLUE 
or  WHITE). 


rating  of  1  Indicated  no  experience  with  video  games;  5  Indicated  much 
experience. 
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This  measure  contains  15  items,  with  7  requiring  responses  on  the  WHITE 
key  and  8  requiring  responses  on  the  BLUE  key.  Although  the  test  is  self- 
paced,  the  computer  is  programmed  to  allow  9  seconds  for  a  response  before 
going  on  to  the  next  Item. 

Test  Characteristics.  Table  II. 9  provides  data  describing  this  test  as 
given  at  Fort  Lewis.  Note  that  subjects  were  reading  the  Instructions  more 
quickly  than  they  were  for  Simple  Reaction  Time  and  were  also  finishing  the 
test  more  quickly. 

Information  about  whether  the  same  or  different  hands  were  used  to 
respond  to  all  Items  is  not  reported  In  this  table.  Data  on  hand  use  indi¬ 
cate  that  23#  of  the  subjects  (N  *  26)  consistently  used  the  same  hand.  The 
remainder  (77tf,  N  ■  86)  switched  from  hand  to  hand  at  least  once  to  respond. 

Test  Scores.  Mean  values  along  with  standard  deviations,  ranges,  and 
reliability  estimates  are  provided  In  Table  II. 9.  Note  that  for  this  mea¬ 
sure,  only  the  two  reaction  time  scores  provide  reliable  Information.  (These 
reliability  estimates  were  calculated  using  an  odd-even  procedure  with  a 
Spearman-Brown  correction.) 

Another  measure  Involves  the  difference  between  mean  Choice  Reaction 
Time  scores  and  Simple  Reaction  Time  scores.  This  value  Is  Intended  to 
capture  a  speed  of  processing  component.  Mote  that  reliability  estimates 
suggest  these  values  are  Internally  consistent. 

Modification  for  Fort  Knox  Field  Test.  No  changes  were  made  to  this 
test  for  the  Fort  Knox  field  test. 


Construct  -  Short-Term  Memory 

This  construct  is  defined  as  the  rate  at  which  one  observes,  searches, 
and  recalls  Information  contained  In  short-term  memory. 


Memory  Search  Test 

The  marker  used  for  this  test  Is  a  short-term  mpmory  search  task  Intro¬ 
duced  by  S.  Sternberg  (1966,  1969).  In  this  test,  the  subject  Is  presented 
with  a  set  of  one  to  five  familiar  items  (e.g.,  letters);  these  are  withdrawn 
and  then  the  subject  is  presented  with  a  probe  Item.  The  subject  Is  to  In¬ 
dicate,  as  rapidly  and  as  accurately  as  possible,  whether  or  not  the  probe 
was  contained  in  the  original  set  of  items,  now  held  in  short-term  memory. 
Generally,  mean  reaction  time  Is  regressed  against  the  number  of  objects  in 
the  item  or  stimulus  set.  The  slope  of  this  function  can  be  interpreted  as 
the  average  Increase  In  reaction  time  with  an  increase  of  one  object  In  the 
memory  set,  or  the  rate  at  which  one  can  access  information  in  short-term 
memory. 

The  measure  developed  for  computer-administered  testing  is  very  similar 
to  that  designed  by  Sternberg.  At  the  computer  console,  the  subject  is  in¬ 
structed  to  place  his  or  her  hands  on  the  green  home  buttons.  The  first 


Table  11.9 


3 

3 

s 

2 


Reaction  Time  Test  2  (Choice  Reaction  Time): 
Fort  Lewis  Pilot  Test 


Descriptive  Characteristics 

Time  to  Read  Instructions  (minutes) 
Time  to  Complete  Test  (minutes) 

Total  Test  Time  (minutes) 

Time-Outs  (number  per  person) 

Invalid  Responses  (number  per  person) 

Test  Scores 

Mean  Decision  Time*3 
Mean  Total  Time*3 
SD  -  Total  Time*3 
Percent  Correct 

Choice  RT  Minus  Simple  RT 

Decision  Time*3 
Total  Time*3 


a  Rxx  "  odd-even  correlations  corrected  with  the  Spearman-Brown  formula, 

b  Values  reported  are  In  hundredths  of  a  second.  Statistics  are  based  on 
analysis  of  all  15  Items  of  the  test. 


Mean 

SD 

Range 

1.01 

.36 

.45  - 

2.37 

.95 

.13 

.80  - 

1.59 

1.95 

.40 

1.37  - 

3.20 

0 

0 

0  - 

1 

.17 

.10 

0  - 

1 

Mean 

SD 

Range 

Rxxa 

36.78 

7.76 

18.75  - 

78.29 

.94 

65.98 

10.38 

37.75  - 

117.29 

.91 

8.92 

3.75 

1.09  - 

60.07 

,10 

99% 

3% 

90  - 

100% 

-.16 

Mean 

SD 

Range 

Rxxa 

7.68 

0.79 

-43.70  - 

33.99 

.86 

10.37 

11.15 

-44.92  - 

38.71 

.79 
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stimulus  set  then  appears  on  the  screen.  A  stimulus  ontains  one,  two, 
three,  four,  or  five  objects  (letters).  Following  a  .5-second  or  1.0-second 
display  period,  the  stimulus  set  disappears  and,  after  a  delay,  the  probe 
item  appears.  Presentation  of  the  probe  item  is  delayed  by  either  2.5 
seconds  or  3.0  seconds.  When  the  probe  appears,  the  subject  must  decide 

whether  or  not  it  appeared  in  the  stimulus  set.  If  the  item  was  present  in 
the  stimulus  set,  the  subject  removes  his  or  her  hands  from  the  home  buttons 
and  strikes  the  white  key.  If  the  probe  item  was  not  present,  the  subject 
strikes  the  blue  key. 

Parameters  of  Interest  include,  first,  stimulus  set  length,  or  number  of 
letters  In  the  stimulus  set.  Values  for  this  parameter  range  from  one  to 

five.  The  second  parameter,  observation  period  and  probe  delay  period, 

includes  two  levels.  The  first  Is  described  as  long  observation  and  short 
probe  delay;  time  periods  are  1.0  second  and  2.5  seconds,  respectively.  The 
second  level,  short  observation  and  long  probe  delay,  includes  periods  of  .5 
second  and  3.0  seconds,  respectively.  The  third  parameter,  probe  status, 

indicates  that  the  probe  Is  either  _1n_  the  stimulus  set  or  not  In  in  the 
stimulus  set. 

Test  Characteristics.  Table  II. 10  provides  descriptive  Information  for 
the  Memory  Search'  Test.  Subjects  were  allowed  very  few  time-outs  (mean  ■ 
.17,  SD  ■  .80)  and  provided  about  five  invalid  responses  (range  0  -  28). 
Overall,  total  percentage  correct  is  90.  However,  the  range  of  percent- 
correct  values,  44  to  100,  Indicates  that  at  least  one  subject  was  performing 
at  a  lower-than-chance  level. 

Test  Scores.  Table  11.10  provides  Information  for  the  total  time  score, 
which  was  computed  and  then  plotted  against  Item  length,  defined  as  the  num¬ 
ber  of  letters  In  the  stimulus  set.  These  plots  indicated  that  decision  and 
total  time  produce  very  similar  profiles,  whereas  movement  time  results  in  a 
nearly  flat  profile.  Since  decision  time  and  total  time  yield  similar  Infor¬ 
mation  and  movement  time  appears  to  serve  as  a  constant,  we  could  have  used 
either  decision  or  total  reaction  time  to  compute  scores  on  this  measure.  We 
elected  to  use  total  reaction  time. 

Subjects  receive  scores  on  the  following  measures; 

•  Slope  and  Intercept  -  These  values  are  obtained  by  regressing  mean 
total  reaction  time  (correct  responses  only)  against  Item  length. 
In  terms  of  processing  efficiency,  the  slope  represents  the  average 
Increase  In  reaction  time  with  an  increase  of  one  object  In  the 
stimulus  set;  the  lower  the  value,  the  faster  the  access.  The 
Intercept  represents  all  other  processes  not  involved  in  memory 
search,  such  as  encoding  the  probe,  determining  whether  or  not  a 
match  has  been  found,  and  executing  the  response. 

•  Percent  Correct  -  This  value  is  used  to  screen  subjects  completing 
the  test.  For  example,  in  Tabic  II. 10  we  indicated  that  one  sub¬ 
ject  correctly  answered  44ft  of  the  items.  Computing  the  above 
scores  for  this  subject  (i.e.,  slope  and  Intercept)  would  be  mean¬ 
ingless.  Percent-correct  scores  are  used  to  identify  subjects 
performing  at  very  low  levels,  thereby  precluding  computation  of 
the  above  scores. 


V, 
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Grand  Mean  -  This  value  is  calculated  by  first  computing  the  mean 

reaction  time  (correct  responses  only)  for  each  level  of  stimulus 
set  length  (l.e.,  one  to  five).  The  mean  of  these  means  is  then 
computed. 


Table  II.  10  contains  the  mean,  standard  deviation,  range,  and 
reliability  estimates  for  each  of  the  scores.  Note  that  all  values  except 
the  slope  yield  fairly  high  internal  consistency. 


Table  II. 10 

Memory  Search  Test:  Fort  Lewis  Pilot  Test 


Test  Characteristics 

Mean 

SD 

Range 

Time  to  Read  Instructions  (minutes) 

3.06 

.76 

1.64 

-  5.81 

Time  to  Complete  Test  (minutes) 

9.00 

.54 

8.37 

-  11.71 

Total  Test  Time  (minutes) 

12.07 

1.06 

10.43 

-  17.52 

Time-Outs  (number  per  person) 

.17 

,80 

0 

-  8 

Invalid  Responses  (number  per  person) 

4.86 

4.72 

0 

-  28 

Test  Scores3 

Mean 

SO 

Range 

Rxx^ 

Slopec 

7.19 

6.14 

-12.70 

-  41.53 

.54 

Mean  Total  T1mec 

97.53 

30.38 

44.91 

-  230.97 

.84 

SD  -  Total  Time0 

119.05 

29.84 

67.71 

-  262.35 

.88 

Percent  Correct 

89% 

10% 

44 

100% 

.95 

aSee  text  for  explanation  of  these  measures. 

^Rxx  =  odd-even  correlation  corrected  with  the  Spearman -Brown  formula. 


cValues  reported  are  in  hundredths  of  a  second.  Statistics  are  based  on  an 
analysis  of  items  answered  correctly.  (There  were  bO  items  on  the  test.) 
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Modifications  for  Fort  Knox  Field  Test.  Results  from  an  analysis  of 
variance  conducted  for  the  three  parameters  were  used  to  modify  this  test  for 
the  Fort  Knox  field  test.  Total  reaction  time  served  as  the  dependent 
variable  for  this  measure.  These  data  Indicated  that  the  two  levels  of 
observation  period  and  probe  delay  yielded  no  significant  differences  in 
reaction  time  (F  ■  .27;  jj<,60) .  For  stimulus  set  length,  levels  one  to  five, 
mean  reaction  time  scores  differed  significantly  (JF  ■  84.35;  pc.OOl),  This 
Information  confirms  results  reported  In  the  UterTture,  which” suggest  that 
reaction  time  increases  as  stimulus  set  length  Increases.  Finally,  for  probe 
status,  In  or  not  In,  mean  reaction  time  scores  also  differed  significantly 
(JF  ■  74.24;  £<.0Q1),  These  values  Indicate  that  subjects,  on  the  average, 
require  more  time  to  determine  that  a  probe  Is  not  in  the  set  than  to 
determine  that  the  probe  Is  contained  In  the  set.  Results  also  Indicated  a 
significant  Interaction  between  stimulus  length  and  probe  status  (JF  ■  7,46; 

p<.001) . 

This  Information  was  used  to  modify  the  Memory  Search  Test,  For 
example,  stimulus  set  length  had  yielded  significant  mean  reaction  time  score 
differences  for  the  five  levels.  Mean  reaction  time  for  levels  two  and  four, 
however,  differed  little  from  levels  three  and  five,  respectively.  Thus, 
Items  containing  stimulus  sets  with  two  and  four  letters  were  deleted  from 
the  test  file. 

Although  observation  period  and  probe  delay  parameters  produced 
non-significant  results,  we  concluded  that  different  values  for  probe  delay 
may  provide  additional  information  about  processing  and  memory.  For  example, 
in  literature  1  ri  this  area,  researchers  suggest  that  subjects  begin  with  a 
visual  memory  of  the  stimulus  objects.  After  a  very  brief  period,  .5  second, 
the  visual ,  memory  begins  to  decay.  To  retain  a  memory  of  the  object  set, 
subjects  shift  to  an  acoustic  memory;  that  Is,  subjects  rehearse  the  sounds 
of  the  object  set  and  recall  Its  contents  acoustically  (Thorson,  Hochhaus,  ft 
Stanners,  1976).  Therefore,  we  changed  the  two  probe  delay  periods  to  .5 
second  and  2.5  seconds.  These  periods  are  designed  to  assess  the  two 
hypothesized  types  of  short-term  memory— vl  sual  and  acoustic. 

Finally,  half  of  the  items  Included  in  the  test  were  modified  to  Include 
unusual  or  unfamiliar  objects— symbols,  rather  than  letters.  In  part,  the 
rationale  for  using  letters  or  digits  involves  using  overlearned  stimuli  so 
that  novelty  of  the  stimulus  does  not  Impact  on  processing  the  material.  We 
elected,  however,  to  Include  a  measure  of  processing  and  recalling  novel  or 
unusual  mater'  1,  primarily  because  Army  recruits  do  encounter  and  are 
required  to  recall  stimuli  that  are  novel  to  them,  especially  during  tuelr 
Initial  training.  Thus,  one-half  of  the  test  Items  ask  subjects  to  observe 
and  recall  unfamiliar  symbols  rather  than  letters. 

The  test  then,  as  modified,  contains  48  Items— one  half  consisting  of 
letters  and  the  other  half  of  symbols.  Within  each  item  type,  three  levels 
of  stimulus  length  are  Included.  That  is,  for  Items  with  letter  stimulus 
sets,  there  are  eight  items  with  a  single  letter,  eight  with  three,  and  eight 
with  five  letters.  The  same  Is  done  for  items  containing  symbols.  Within 
e?rh  of  the  stimulus  length  sets,  four  items  include  a  5-second  probe  delay 
and  four  contain  a  2. 5-second  probe  delay  period.  Across  all  Items  (N  =  48), 
probe  status  Is  equally  mixed  between  "In"  and  "not  In"  the  stimulus  set. 
With  the  test  so  constructed,  the  effects  of  stimulus  type,  stimulus  set 
length,  probe  delay  period,  and  probe  status  can  be  examined, 


Cons  cruet  -  Perceptual  Speed  and  Accuracy 


Perceptual  speed  and  accuracy  involves  the  ability  to  perceive  visual 
Information  quickly  and  accurately  and  to  perform  simple  processing  tasks 
with  the  stimulus  (e.g.,  make  comparisons).  This  requires  the  ability  to 
make  rapid  scanning  movements  without  being  distracted  by  Irrelevant  visual 
stimuli,  and  measures  memory,  working  speed,  and  sometimes  eye-hand 
coordination. 


Perceptual  Speed  and  Accuracy 


Measures  used  as  markers  for  the  development  of  the  computerized 
Perceptual  Speed  and  Accuracy  { PSAA )  Test  Included  such  tests  as  the  Employee 
Aptitude  Survey  Visual  Speed  and  Accuracy  (EA5-4),  and  the  ASVAB  Coding  Speed 
and  Tables  and  Graphs  tests.  The  EAS-4  involves  the  ability  to  quickly  and 
accurately  compare  numbers  and  determine  whether  they  are  the  same  or 
different,  whereas  the  ASVAB  Coding  Speed  Test  measures  memory,  eye-hand 
coordination,  and  working  speed.  The  Tables  and  Graphs  Test  requires  the 
ability  to  obtain  Information  quickly  and  accurately  from  material  presented 
In  tabular  form. 


The  computer-administered  Perceptual  Speed  and  Accuracy  Test  requires 
the  ability  to  make  a  rapid  comparlsion  of  two  visual  stimuli  presented 
simultaneously  and  determine  wnether  they  are  the  same  or  different.  Five 
different  types  of  stimuli  are  presented:  alpha,  numeric,  symbolic,  mixed, 
and  word.  Within  the  alpha,  numeric,  symbolic,  and  mixed  stimuli,  the 
character  length  of  the  stimulus  Is  varied.  Four  different  levels  of 
stimulus  length  or  "digit"  are  present:  two-digit,  five-digit,  seven-digit, 
and  nine-digit.  Four  Items  are  Included  In  each  Type  by  Digit  cell.  For 
example,  four  Items  are  two-digit  alphas  (e.g.,  XA).  In  Its  original  form 
this  test  had: 


Reaction  times  were  expected  to  Increase  with  the  number  of  digits 
included  in  the  stimulus.  The  rationale  for  including  various  types  of 
stimuli  was  simply  that  soldiers  often  encounter  various  types  of  stimuli  in 
military  positions. 


The  subject  is  Instructed  to  hold  the  home  keys  down  to  begin  each  item, 
release  the  home  keys  upon  deciding  whether  the  stimuli  are  the  same  or 
different,  and  depress  a  white  button  if  the  stimuli  are  the  same  or  a  blue 
button  if  the  stimuli  are  different.  The  measures  obtained  are:  response 
hand,  percent  correct,  total  reaction  time,  decision  time,  movement  time, 
time  for  instructions,  and  total  tpst  time. 


Test  Characteristics.  The  computerized  Perceptual  Speed  and  Accuracy 
Test  was  administered  to  112  individuals  at  Fort  Lewis.  Some  of  the  overall 
test  characteristics  are  shown  in  Table  II. 11. 


Table  I I. 11 


Overall  Characteristics  of  Perceptual  Speed  and  Accuracy  Test: 
Fort  Lewis  Pilot  Test 


Mean 

SD 

Range 

Time  Spent  on  Instructions  (minutes) 

2.36 

.55 

1.37  -  4.30 

Time  Spent  on  Test  Portion  (minutes) 

7.82 

1..04 

5.82  -  12.41 

Total  Testing  Time  (minutes) 

10.18 

1.37 

7.45  -  14.88 

Time-Outs  (number  per  person) 

9.57 

6.17 

0  -  35 

Invalid  Responses  (number  per  person) 

.94 

1.20 

O  -  6 

Two  two-way  analyses  of  variance  were  performed  on  reaction  times  for 
correct  responses.  The  first  was  a  Type  (4  levels)  by  Digit  (4  levels)  ANOVA 
of  total  reaction  times.  The  results  showed  significant  main  effects  fur 
Type  [F  (3,333)  =  11.99,  jj<.001].  Digits  [£  (3,333)  =  871,46,  p < . 00 1 ] ,  and 
their  Interaction  [F  (9,999)  =  44.14,  p<,001].  The  second  AriOVA  was  on 
movement  times.  Pure  movement  time  should  be  a  constant  when  response  hands 
are  balanced.  The  results  suggested  that  subjects  were  still  making  their 
decision  about  the  stimuli  after  releasing  the  home  keys.  That  is,  the 
movement  time  ANOVA  for  Type  by  Digits  yielded  a  significant  main  effect  for 
Digits  [JF  (3,333)  =  19.94,  £<.001].  The  interaction  of  Digits  and  Type  was 
also  significant  [£_  (9,999)  3  7,22,  £<.001], 
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The  implications  of  these  results  are: 

•  Scores  should  be  formed  on  total  reaction  times  (for  correct 

responses)  instead  of  decision  times  because  subjects  appear  to 
continue  making  a  decision  after  releasing  the  home  keys, 

•  Means  should  be  computed  separately  for  each  set  of  times  with  a 

'  particular  digit  level  (l.e.,  two,  five,  seven,  and  nine).  Number 

of  digits  had  a  greater  effect  on  mean  reaction  time  than  did 
type.  Since  only  correct  response  reaction  times  are  being  used, 
subjects  could  raise  their  scores  on  a  pooled  reaction  time  by 
simply  not  responding  to  the  nine-digit  items.  Thus,  the  mean 
reaction  times  to  correct  responses  for  each  digit  level  should  be 
equally  weighted.  The  grand  mean  of  the  mean  reaction  times  for 
each  digit  level  was  computed. 

•  The  nine-digit  symbolic  Items  were  probably  too  easy.  Mean  re¬ 

action  times  for  the  nine-digit  symbolic  Items  were  substantially 
less  than  those  for  the  other  nine-digit  Items.  Further  Inspection 
o*  the  Items  showed  that  some  of  the  Items  were  probably  being 
processed  In  "chunks"  (e.g.,  «*■+++*//). 

/ 

•  Total  reaction  times  for  correct  responses  could  be  regressed  on 
digit  and  Intercepts  and  slopes  computed  for  Individuals  by  means 
of  repeated  measures  regression  (l.e.,  the  trend  appeared  to  be 
linear). 

Test  Scores.  As  a  whole,  the  scores  on  the  computer-adinlnisterd  Per- 
ceptual  Speed  and  Accuracy  Test  were  quite  reliable  (see  Table  11.12). 

Reliability  coefficients  ranged  from  .85  for  the  Intercept  of  the  regression 
of  total  reaction  time  on  digits  to  .97  for  the  grand  mean  of  the  mean  re¬ 
action  times  for  the  four  non-word  categories  and  the  word  category. 

Interrelationships  Among  PS&A  Scores.  Ideally,  efficient  performance  on 
the  PS&A  Test  would  produce  a  low  Intercept,  a  low  slope,  and  high  accuracy, 

combined  with  a  fast  grand  mean  reaction  time  score.  Data  analyzed  from 

the  Fort  Lewis  testing  (N  ■  112)  suggest  that  this  relationship  may  occur 
Infrequently.  As  shown  In  Table  11.13,  the  relationship  of  the  slope  with 
the  intercept  Is  negative.  That  Is,  low  intercepts  tend  to  correspond  with 
steep  slopes.  However,  It  Is  possible  that  Individuals  who  obtained  low 

Intercepts  simply  had  more  "room"  to  increase  their  reaction  times  within  the 
7-second  time  limit,  thus  Increasing  their  slope  scores.  Since  high  Inter¬ 
cept  values  were  related  to  slower  grand  mean  reaction  times,  as  well  as  less 
accurate  performance,  and  more  "time-outs"  occurred  on  the  nine-digit  Items, 
it  Is  likely  that  the  7-second  time  limit  produced  a  celling  effect. 

The  high  positive  correlation  between  the  slope  and  accuracy  suggests 
that  performing  accurately  Is  related  to  a  substantial  increase  In  reaction 
time  as  the  stimuli  increase  in  length.  Steeper  slopes  also  correspond  with 
slower  grand  mean  reaction  times.  These  slower  reaction  times  were  also 
related  to  higher  accuracy. 


ft 
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Table  11.12 

Scores  From  Perceptual  Speed  and  Accuracy  Test 
Fort  Lewis  Pilot  Test 


Score3 

Mean 

SD 

Range 

Rxxb 

Grand  Mean  of  Mean  Reaction 
Non-Word  Items 

Times 

for 

279.99 

57.97 

85.67  -  386.49 

.97 

Mean  Reaction  Time  for 

Word  Items 

351.74 

68.39 

198.64  -  518.64 

.91 

Grand  Mean  of  Mean  Reaction 
Word  and  Non-Word  Items 

Times 

for 

294.22 

57,13 

109.34  -  412.75 

.97 

Intercept 

89.37 

36.48 

12.99  -  210.34 

.85 

Slope 

33.14 

9.78 

-.75  -  52.11 

.89  j 

Percent  Correct 

86. 9% 

8,0% 

56.3  -  100% 

_ 

a  Reaction  Time  values  are  In  hundredths  of  a  second  and  are  based  on 
analysis  of  Items  answered  correctly.  (There  were  80  Items  on  the 

test.) 

b  Split  halves  (odd-even)  reliability  estimates,  Spearman-Brown  corrected. 


Table  11.13 

Intercorrelations  Among  Perceptual  Speed  and  Accuracy  Test  Scores: 
Fort  Lewis  Pilot  Test 


Score 


Percent 

Intercept  Slope  Correct 


Si  ope 

«o 

1 — 
<\J 
• 

1 

Percent  Correct 

-,26b 

Grand  Meanc 

,36b 

eari  reaction  time  In  this  section  refers  to: 

ean  *  Tj-dlglts  ►  X^-dlglts  +  ^-digits  +  ^g-dlglts  +  T  words 
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Modifications  for  Fort  Knox  Field  Test.  Several  changes  were  made  to 

this  test.  A  reduction  In  the  number  of  Items  was  desirable  In  order  to  cut 
down  the  testing  time,  and  the  reliability  of  the  test  scores  Indicated  that 
the  test  length  could  be  considerably  reduced  without  causing  the 
reliabilities  to  fall  below  acceptable  levels  (see  Table  11.12). 

Item  deletion  was  accomplished  In  two  ways.  First,  all  the  seven-digit 
Items  were  deleted  (16  Items).  Such  deletions  should  have  little  effect  on 
the  test  scores,  since  the  relationship  between  number  of  digits  and  reaction 
time  Is  linear,  and  the  Items  containing  two,  five,  and  nine  digits  should 
provide  sufficient  data  points. 

Second,  four  Items  were  deleted  from  each  of  the  remaining  three  digit 
categories  (two,  five,  and  nine)  and  from  the  "word"  Items.  Thus,  16  more 
Items  were  deleted.  The  following  factors  were  considered  In  selecting 
Items  for  deletion: 

•  Item  Intercorrelatlons  within  stimulus  type  and  digit  size  were 
examined.  In  many  cases,  one  Item  did  not  correlate  highly  with 
the  others.  Items  that  produced  the  lowest  Intercorrelatlons  were 
deleted.  Use  of  this  criterion  resulted  In  13  Item  deletions. 

•  When  Item  Interrelations  did  not  differ  substantially,  accuracy 
rates  and  variances  were  reviewed.  These  factors  did  not  Indicate 
any  clear  candidates  for  deletion. 

•  When  all  the  above  were  approximately  equal,  the  decision  to  delete 
an  Item  was  based  on  Its  correct  response  ( 1  .e. ,  'same  or 
"different").  The  Item  which  would  have  caused  an  Imbalance 
between  the  responses  (If  retained)  was  deleted.  This  was,  In 
effect,  a  random  selection. 

Several  other  changes  were  made,  either  to  correct  perceived  short¬ 
comings  or  to  otherwise  Improve  the  test.  The  symbolic,  nine-digit  items 
were  modified  to  make  them  more  difficult.  (As  previously  noted,  these 
Items  had  originally  been  developed  in  such  a  way  that  the  symbols  were  In 
"chunks,"  thus  making  the  Items,  In  effect,  much  shorter  than  the  Intended 
nine  digits;  these  group  symbols  were  broken  up.)  Five  Items  were  changed  so 
that  the  correct  response  was  "different"  rather  than  "same"  in  order  to 
balance  type  of  correct  response  within  digit  level.  Finally,  the  time 
allowed  to  make  a  response  to  an  item  was  increased  from  7  seconds  to  9 
seconds  In  order  to  give  subjects  sufficient  time  to  respond,  especially  for 
the  more  difficult  Items. 

The  revised  test,  then,  contained  48  items;  36  were  divided  Into  12  Type 
(alpha,  numeric,  symbolic,  mixed)  by  Number  of  Digits  (two,  five,  nine) 
cells,  arid  12  were  "word"  Items. 


Target  Identification  Test 

The  Target  Identification  Test  Is  a  measure  of  the  perceptual  speed  and 
accuracy  construct.  The  objects  perceived  are  meaningful  figures,  however, 
rather  than  being  made  up  of  numbers,  letters,  or  symbols.  In  this  test, 


each  item  shows  a  target  object  near  the  top  of  the  screen  and  three  labeled 
stimuli  in  a  row  near  the  bottom  of  the  screen.  Examples  are  shown  in  Figure 

11.16.  Hie  subject  is  to  identify  which  of  th*'ee  stimuli  represents  the  same 
object  as  the  target  and  to  press,  as  quickly  as  possible,  the  button  (blue, 
yellow,  or  white)  that  corresponds  to  that  object. 

The  objects  shown  are  based  on  military  vehicles  and  aircraft  as  shown 
on  the  standard  set  of  flashcards  used  to  train  soldiers  to  recognize  equip¬ 
ment  presently  being  used  by  various  nations.  Twenty-two  drawings  of  objects 
were  prepared. 

Several  parameters  were  varied  In  the  stimulus  presentation.  In  addi¬ 
tion  to  type  of  object,  a  second  parameter  was  the  position  of  the  correct 
response--on  the  left,  In  the  middle,  or  on  the  right  side  of  the  screen.  A 
third  was  the  orientation  of  the  target  object- -whether  the  object  is  "fac¬ 
ing"  in  the  same  direction  as  the  stimuli  or  in  the  opposite  direction. 

A  fourth  parameter  was  the  angle  of  rotation  (from  horizontal)  of  the 
target  object.  Seven  different  angular  rotations  were  used  for  the  Fort 
Lewis  administration:  06,  20°,  25°,  30°,  35° ,  40°,  and  45°.  The  fifth 
parameter  was  the  size  of  the  target  object.  Ten  different  levels  of  size 
reduction  were  used  In  the  Fort  Lewis  administration:  40% ,  S0%,  55% ,  COX, 
65% ,  75% t  80% ,  85%,  90%,  and  100%.  Fifty  percent  reduction  means  that  the 
target  object  was  half  the  size  of  the  stimulus  objects  at  the  bottom  of  the 
screen;  100%  is  full  size. 

There  was  no  Intention  of  creating  a  test  that  had  items  tapping  each 
cell  of  a  crossed  design  for  these  five  parameters.  Instead,  we  viewed  this 
tryout  of  the  test  as  an  opportunity  to  explore  a  number  of  different  factors 
that  could  conceivably  affect  test  performance.  A  total  of  44  items  were 
included  on  the  test. 

Test  Characteristics.  Table  11.14  shows  data  from  the  Fort  Lewis  pilot 
test  of  the  Target  Identification  Test.  The  lower  part  of  the  table  shows 
data  from  the  two  measures  of  concern:  total  reaction  time  and  percent  cor¬ 
rect.  The  test  was  conceived  as  a  speeded  test,  in  the  sense  that  each  Item 
could  be  answered  correctly  if  the  subject  took  sufficient  time  to  study  the 
Items  and,  therefore,  the  reaction  time  measure  was  Intended  to  show  the  most 
variance.  The  data  show  that  these  Intentions  were  achieved. 

Score  Variables.  As  noted  above,  the  primary  scores  for  this  test  were 
total  reaction  time  (includes  both  decision  and  movement  times)  for  correct 
responses,  and  the  percent  of  responses  that  were  correct.  Total  reaction 
time  was  used  rather  than  decision  time  because  It  seems  to  be  more  ecologi¬ 
cally  valid  (l.e.,  the  Army  is  interested  in  how  quickly  a  soldier  can  per¬ 
ceive,  decide,  and  take  some  action  and  not  just  in  the  decision  time). 
Also,  analyses  of  variance  showed  similar  results  for  the  two  measures. 

Modifications  for  the  Fort  Knox  Field  Test.  The  revised  test  consisted 


of  48  items  instead  of  44.  Two  parameters  of  the  test  were  left  unchanged-- 
position  of  the  object  that  "matched"  the  target  and  direction  in  which  the 
target  object  faced--even  though  analyses  of  the  Fort  Lewis  data  Indicated 
that  opposite-facing  targets  appeared  to  be  more  difficult  and  the  middle 
position  of  the  correct  object  was  slightly  "easier." 
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Table  11.14 

Target  Identification  Test:  Fort  Lewis  Pilot  Test 


Descriptive  Characteristics 

Mean 

SO 

Range 

Time  to  Read  Instructions  (minutes) 

2.01 

1.04 

1.10  - 

9.21 

Time  to  Complete  Test  (minutes) 

3.61 

0.45 

2.96  - 

5.46 

Total  Test  Time  (minutes) 

5.62 

1.23 

4.12  - 

12.81 

Time-Outs  (per  person) 

.06 

.28 

0  - 

2 

Invalid  Responses  (per  person) 

3.20 

3.61 

0  - 

29 

Test  Scores 

Mean 

SD 

Range 

Rxx 

Total  Reaction  Time*3 

218.51 

68.75 

113.10  - 

492.95 

.97 

Percent  Correct 

92.6% 

8.3% 

34.1  - 

100% 

.78 

a  Reliability  estimates  computed  using  odd-even  procedure  with  Spearman- 
Brown  correction. 

b  In  hundredths  of  a  second. 


Three  parameters  were  changed.  The  objects  to  be  matched  with  the 
target  were  made  to  be  all  from  one  type  (helicopters  or  aircraft  or  tanks, 
etc.)  or  from  two  types,  rather  than  from  one,  two,  or  three.  Ihls  was  done 
because  analyses  showed  the  "three-type"  Items  to  be  extremely  easy.  Rota¬ 
tion  angles  were  reduced  from  seven  levels  to  just  two,  0®  and  45°,  since 
analyses  showed  that  angular  rotations  near  0°  had  very  little  effect  on 
reaction  time. 

Finally,  the  size  parameter  was  radically  cnanged.  The  target  object 
was  either  50%  of  the  stimulus  objects,  or  was  made  to  "move."  The  "moving" 
Items  were  made  to  appear  Initially  on  the  screen  as  a  very  small  dot,  com¬ 
pletely  Indistinguishable,  and  then  to  quickly  and  successively  disappear  and 
reappear,  slightly  enlarged  In  size  and  slightly  to  the  left  (or  right,  de¬ 
pending  on  the  side  of  the  screen  where  the  target  Initially  appeared)  of  the 
prior  appearance.  Thus,  the  subject  had  to  observe  the  moving  and  enlarging 
target  until  certain  of  matching  It  to  one  of  the  stimulus  objects.  These 
"moving"  Items  were  thought  to  represent  greater  ecological  or  content  valid¬ 
ity,  but  still  to  be  a  part  of  the  flgural  perception  construct. 
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Construct  -  Psychomotor  Precision 


This  construct  reflects  the  ability  to  make  muscular  movements  necessary 
to  adjust  or  position  a  machine  control  mechanism.  This  ability  applies  to 
both  anticipatory  movements  (l.e.,  where  the  subject  must  respond  to  a 
stimulus  condition  that  Is  continuously  changing  In  an  unpredictable  manner) 
and  controlled  movements  (l.e.,  where  the  subject  must  respond  to  a  stimulus 
condition  that  Is  changing  In  a  predictable  fashion,  or  making  only  a 
relatively  few  discrete,  unpredictable  changes).  Psychomotor  precision  thus 
encompasses  two  cf  the  ability  constructs  Identified  by  Fleishman  and  his 
associates,  control  precision  and  rate  control  (Fleishman,  1967). 

Performance  on  tracking  tasks  Is  very  likely  related  to  psychomotor 
precision.  Since  tracking  tasks  are  an  Important  part  of  many  Army  MOS, 
development  of  psychomotor  precision  tests  was  made  a  high  priority.  The 
Fort  Lewis  computer  battery  Included  two  measures  of  this  ability. 


Target  Tracking  Test  1 

Target  Tracking  Test  1  was  designed  to  measure  subjects'  ability  to  make 
fine,  highly  controlled  movements  to  adjust  a  machine  control  mechanism  In 
response  to  a  stimulus  whose  speed  and  direction  of  movement  are  perfectly 
predictable.  Fleishman  labeled  this  ability  control  precision.  The  Rotary 
Pursuit  Test  (Melton,  1947)  served  as  a  model  for  Target  Tracking  Test  1. 

For  each  trial  of  this  pursuit  tracking  test,  subjects  are  shown  a  path 
consisting  entirely  of  vertical  and  horizontal  line  segments.  At  the 
beginning  of  the  path  Is  a  target  box,  and  centered  In  the  box  are 
crosshairs.  As  the  trial  begins,  the  target  starts  to  move  along  the  path  at 
a  constant  rate  of  speed.  The  subject's  task  Is  to  keep  the  crosshairs 
centered  within  the  target  at  all  times.  The  subject  uses  a  joystick, 
controlled  with  one  hand,  to  control  movement  of  the  crosshairs. 

Several  Item  parameters  vary  from  trial  to  trial.  These  Include  the 
speed  of  the  crosshairs,  the  maximum  speed  of  the  target,  the  difference 
between  crosshairs  and  target  speeds,  the  total  length  of  the  path,  the 
number  of  line  segments  comprising  the  path,  and  the  average  amount  of  time 
the  target  spends  traveling  along  each  segment.  Obviously,  these  parameters 
are  not  all  Independent;  for  example,  crosshairs  speed  and  maximum  target 
speed  determine  the  difference  between  crosshairs  and  target  speeds. 

For  the  Fort  Lewis  battery,  subjects  were  given  18  test  trials,  Three 
of  the  18  paths  were  duplicates  (the  paths  for  trials  15-17  were  Identical  to 
the  paths  for  trials  1,  2,  and  7).  Ignoring  these  duplicates,  the  test  was 
constructed  so  that  the  trials  at  the  beginning  of  the  test  were  easier  than 
trials  at  the  end  of  the  test. 

Test  Characteristics.  Table  11.15  presents  data  for  Target  Tracking 
Test  "I  based  on  tTe  FTrt  Lewis  pilot  test.  The  18  trials  of  the  test 
required  9.07  minutes  to  complete.  Since  all  subjects  received  the  same  set 
of  paths,  there  was  virtually  no  variability. 


Target  Tracking  Test  1:  Fort  Lewis  Pilot  Test 


Mean 

— IF 

Range 

Time  to  Read  Instructions  (minutes) 

1.20 

.43 

.33  -  3.09 

Time  to  Complete  Test  (minutes) 

9.07 

.02 

9.05  -  9.12 

Total  Test  Time  (minutes) 

10.27 

.43 

9.42  -  12.17 

Test  scores 

Mean 

SD 

Range 

Exxft 

Distance*5 

1.44 

.45 

.95  -  3.40 

.97 

*  Spearman-Brown  corrected  split- 

half  reliability  1 

•or  odd-even  trla' 

IS. 

Square  root  of  the  average  wlthln-tr 1aT  distance  (horizontal  pixels)  from 
the  center  of  the  target  to  the  center  of  the  crosshairs,  averaged  across 
all  18  trials  (or  Items)  on  the  test. 


Test  Scores.  Two  classes  of  measures  were  Investigated!  (a)  tracking 
accuracy  and  ib)  Improvement  In  tracking  performance,  based  on  the  three 
duplicate  paths  Included  In  the  test.  Two  tracking  accuracy  measures  were 
Investigated,  time  on  target  and  distance  from  the  center  of  crosshairs  to 
the  center  of  the  target.  Kelley. (1969)  demonstrated  that  distance  Is  a  more 
reliable  measure  of  tracking  performance  than  time  on  target.  Therefore,  the 
test  program  computes  the  distance3  from  the  crosshairs  to  the  center  of 
the  target  several  times  each  second,  and  then  averages  these  distances  to 
derive  an  overall  accuracy  score  for  that  trial. 


Subsequently,  when  the  distribution  of  subjects'  scores  on  each  trial 
was  examined,  It  was  found  that  the  distribution  was  highly  positively 
skewed.  Consequently,  the  trial  score  was  transformed  by  taking  the  square 
root  of  the  average  distance.  As  a  result,  the  distribution  of  subjects' 
scores  on  each  trial  was  more  nearly  normal.  These  trial  scores  were  then 
averaged  to  determine  an  overall  tracking  accuracy  score  for  each  subject. 


3The  Compaq  video  screen  Is  divided  Into  200  pixels  vertically  and  640 
pixels  horizontally,  with  each  vertical  pixel  equivalent  to  3  horizontal 
pixels.  All  distance  measures  were  computed  In  horizontal  pixel  units. 


Prior  to  the  Fort  Lewis  test,  it  was  expected  that  subjects'  tracking 

proficiency  would  improve  considerably  over  the  course  of  the  test.  That  was 
one  of  the  reasons  that  Initial  test  trials  were  designed  to  be  easier  than 
final  test  trials.  However,  analyses  of  the  Fort  Lewis  data  revealed  that 
subjects'  performance  on  trials  1,  2,  and  7  actually  differed  little  from 
their  performance  on  trials  15-17,  Therefore,  it  was  decided  that  no  measure 
of  Improvement  In  tracking  performance  would  be  computed. 


The  Internal  consistency  reliability  cf  the  accuracy  score  was  computed 
by  comparing  mean  accuracy  scores  for  odd  and  even  trials.  The  Spearman- 
Brown  corrected  reliability  was  .97. 


Four  one-way  analyses  of  variances  were  executed  to  determine  how 
tracking  accuracy  was  affected  by  average  segment  length,  average  time 
required  for  the  target  to  travel  a  segment,  maximum  crosshairs  speed,  and 
difference  between  maximum  crosshairs  speed  and  target  speed.  All  four  Item 
parameters  were  significantly  related  to  accuracy  score,  with  crosshairs 
speed  accounting  for  the  most  variance  and  difference  between  target  and 
crosshairs  speed  the  least.  All  four  parameters  were  highly  Intercorrelated. 


Modifications  for  the  Fort  Knox  Field  Test.  Several  changes  were  made 
In  the  paths  comprising  this""  test  for"  the  Fort  Knox  field  test.  First,  all 
paths  were  modified  so  that  each  would  run  for  the  same  amount  of  time  (ap¬ 
proximately  .36  minute).  The  primary  reason  for  this  change  was  that  the 
program  computes  distance  between  the  crosshairs  and  target  a  set  number  of 
times  each  second.  If  all  paths  run  the  same  amount  of  time,  then  the  ac¬ 
curacy  measure  for  each  trial  will  be  based  on  the  same  number  of  distance 
assessments. 

Second,  three  Item  parameters  were  Identified  to  direct  the  format  of 
test  trials;  maximum  crosshairs  speed,  difference  between  maximum  crosshairs 
speed  and  target  speed,  and  number  of  path  segments.  Given  these  parameters 
and  the  constraint  that  all  trials  run  a  fixed  amount  of  time,  the  values  of 
all  other  Item  parameters  (e.g.,  target  speed,  total  length  of  the  path)  can 
be  determined.  Three  levels  were  Identified  for  each  of  the  three  param¬ 
eters.  These  wore  completely  crossed  to  create  a  27-Item  test.  Items  were 
then  randomly  ordered.  These  procedures  for  Item  development  should  allevi¬ 
ate  previous  problems  Interpreting  test  results  in  light  of  correlated  item 
parameters. 

Third,  In  spite  of  these  changes,  which  added  50%  more  trials  to  the 
test,  testing  time  was  actually  reduced  slightly  (25  seconds  less,  it  was 
estimated),  because  of  the  standardization  of  trial  time. 

Target  Shoot  Test 

The  Target  Shoot  Test  was  modeled  after  several  compensatory  and  pursuit 
tracking  tests  used  by  the  AAF  in  the  Aviation  rsychology  Program  (e.g.,  the 
Rate  Control  Test).  The  distinguishing  feature  of  these  tests  Is  that  the 
target  stimulus  moves  in  a  continuous ;y  changing  and  unpredictable  speed  and 
direction.  Thus,  the  subject  must  attempt  to  anticipate  these  changes  and 
respond  accordingly. 
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For  the  Target  Shoot  Test,  a  target  box  and  a  crosshairs  appear  in 
different  locations  on  the  computer  screen.  The  target  moves  about  the 
screen  In  an  unpredictable  manner,  frequently  changing  speed  and  direction. 
The  subject  controls  movement  of  the  crosshairs  via  a  joystick.  The 
subject's  task  Is  to  move  the  crosshairs  Into  the  center  of  the  target,  and 
when  this  had  been  accomplished,  to  press  a  button  on  the  response  pedestal 
to  "fire"  at  the  target.  The  subject’s  score  on  a  trial  Is  the  distance  from 
the  center  of  the  crosshairs  to  the  center  of  the  target  at  the  time  the 
subject  fires.  The  test  consists  of  40  trials. 

Several  Item  parameters  were  varied  from  trial  to  trial.  These 
parameters  Included  the  maximum  speed  of  the  crosshairs,  the  average  speed  of 
the  target,  the  difference  between  crosshairs  and  target  speeds,  the  number 
of  changes  In  target  speed  (If  any),  the  number  of  line  segments  comprising 
the  path  of  each  target,  and  the  average  amount  of  time  required  for  the 
target  to  travel  each  segment.  These  parameters  are  not  all  independent,  of 
course.  Moreover,  the  nature  of  the  test  creates  a  problem  In  characterizing 
some  trials  since  a  trial  terminates  as  soon  as  the  subject  fires  at  the 
target.  Thus,  one  subject  may  see  only  a  fraction  of  the  line  segments, 
target  speeds,  etc.  that  another  subject  sees. 

Test  Character! sties.  Table  11.16  presents  data  based  on  the  Fort  Lewis 
pi  1  ot  test.' 

Test  Scores.  Three  measures  were  obtained  for  each  trial.  Two  were 
measures  of  firing  accuracy:  (a)  the  distance  from  the  center  of  the 
crosshairs  to  the  center  of  the  target  at  the  time  of  firing,  and  (b)  whether 
the  subject  "hit"  or  "missed"  the  target.  The  two  were  very  highly 
correlated,  though  the  former  provides  quite  a  bit  more  Information  about 
firing  accuracy  than  the  latter.  Therefore,  distance  was  retained  as  the 
accuracy  measure;  distances  were  averaged  across  trials  to  obtain  an  overall 
accuracy  score.  The  third  measure  was  a  speed  measure  which  represented  the 
time  from  trial  onset  until  the  subject  fired  at  the  target. 

Split-half  reliability  across  odd-even  trials  was  computed  for  the  two 
accuracy  measures. 

Changes  for  Fort  Knox,  The  test  was  not  modified  for  the  Fort  Knox 
adml nlstratlon. ~ 


Construct  -  MultlUmb  Coordination 


The  mu  1 1 1 1 1  mb  coordination  construct  reflects  the  ability  to  coordinate 
the  simultaneous  movement  of  two  or  more  limbs.  This  ability  is  general  to 
tasks  requiring  coordination  of  any  two  limbs  (e.g.,  two  hands,  two  feet,  one 
hand  and  one  foot).  The  ability  does  not  apply  to  tasks  In  which  trunk 
movement  must  be  integrated  with  1 1  mb  movements.  It  Is  most  common  In  tasks 
where  the  body  Is  at  rest  (e.g.,  seated  or  standing)  while  two  or  more  limbs 
are  In  motion. 


if. 


In  the  past,  measures  of  multilimb  coordination  have  shewn  quite  high 
validity  for  predicting  job  and  training  performance,  especially  for  pilots 
(Melton,  1947). 
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Table  11.16 


Mean 

SD 

Range 

1.58 

.61 

.51  -  5.10 

2.22 

.23 

O'. 

CM 

• 

CO 

1 

00 

3.80 

.68 

2.71  -  7.58 

2.77 

3.97 

0  -  40 

Mean 

SD 

Range 

2.83 

1  w 

1 

m 

1.93  -  7.03 

58 

13 

0  -  83 

Target  Shoot  Test:  Fort  Lewis  Pilot  Test 


Descriptive  Characteristics  Mee 

Time  to  Read  Instructions  (minutes)  l.J 
Time  to  Complete  Test  (minutes)  2.i 

Total  Test  Time  (minutes)  3.? 

No.  of  Trials  Without  Firing8  2.1 

Test  Scores  Me; 

Distance0  2.! 

Percent  of  Hits®  58 


a0ne  subject  failed  to  fire  at  any  targets.  Excluding  this  subject,  mean, 
SD,  and  range  for  number  of  trials  without  firing  were  2.43,  1.78,  and  0-8, 
respectively;  mean,  SD,  and  range  for  percent  of  hits  were  59,  12,  and  13-83, 
respectively. 

bSpearman-Brown  corrected  split-half  reliability  for  odd-even  trials. 

cSo,uare  root  of  the  distance  (horizontal  pixels)  from  the  center  of  the 
target  to  the  center  of  the  crosshairs  at  the  time  of  firing,  averaged  across 
all  trials  In  which  the  subject  fired  at  the  target.  (There  were  a  total  of 
40  trials  or  Items  on  the  test.) 


Target  Tracking  Test  2 


Target  Tracking  Test  2  Is  modeled  after  a  test  of  mu  1 1 11 1 mb  coordination 
developed  by  the  AAF,  the  Two-Hand  Coorldnatlon  Test,  which  required  subjects 
to  perform  a  pursuit  tracking  task.  Horizontal  and  vortical  movements  of  the 
target-follower  were  controlled  by  two  handles.  Validity  estimates  of  this 
test  for  predicting  AAF  pilot  training  success  were  mostly  In  the  .30s. 

Target  Tracking  Test  2  Is  very  similar  to  the  Two-Hand  Coordination 
Test.  For  each  trial  subjects  are  shown  a  path  consisting  entirely  of  ver¬ 
tical  and  horizontal  lines.  At  the  beginning  of  the  path  Is  a  target  box, 
and  centered  in  the  box  are  crosshairs.  As  the  trial  begins,  the  target 
starts  to  move  along  the  path  at  a  constant  rate  of  speed.  The  subject  man¬ 
ipulates  ;.wo  sliding  resistors  to  control  movement  of  the  crosshairs.  One 
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resistor  controls  movement  in  the  horizontal  plane,  the  other  in  the  vertical 
plane.  The  subject's  task  is  to  keep  the  crosshairs  centered  within  the 
target  at  all  times. 

This  test  and  Target  Tracking  Test  1  are  virtually  Identical  except  for 
the  nature  of  the  required  control  manipulation.  For  Target  Tracking  Test  1 
crosshairs  movement  is  controlled  via  a  joystick,  while  for  Target  Tracking 
Test  2  crosshairs  movement  is  controlled  via  the  two  sliding  resistors.  For 
the  Fort  Lewis  battery,  the  same  18  paths  were  used  In  both  tests,  and  the 
value  of  the  crosshairs  and  target  speed  parameters  was  the  same  In  both  sets 
of  trials.  The  only  other  difference  between  the  two  tests  was  that  subjects 
were  permitted  three  practice  trials  for  Target  Tracking  Test  2. 

Test  Characteristics.  The  descriptive  data  are  shown  In  Table  11.17. 


Table  11.17 

Target  Shoot  Test: 

Fort 

Lewis 

Pilot 

Test, 

Descriptive  Characteristics 

Mean 

SD 

Range 

Time  to  Read  Instructions  (minutes) 

3.58 

.68 

2.39  -  6.38 

Time  to  Complete  Test  (minutes) 

9.09 

.02 

9.03  -  9.13 

Total  Test  Time  (minutes) 

12.67 

.68 

11.50  -  15.48 

Test  Scores 

Mean 

SD 

Range 

!lxxa 

D1stanceh 

2.02 

.64 

0  -  4.01 

.97 

aSpearman -Brown  corrected  split-half  reliability  for  odd-even  trials. 

^Square  root  of  the  distance  (horizontal  pixels)  from  the  center  of  the 
target  to  the  center  of  the  crosshairs,  averaged  across  all  18  trials  (or 
Items)  on  the  test. 


Test  Scores.  The  same  score  was  used  for  this  test  as  for  Tracking  Test 
1;  that  Is,  the  square  root  of  the  average  with  In-trial  distance  from  the 
center  of  the  crosshairs  to  the  center  of  the  target,  averaged  across  all 
trials. 


Four  one-way  analyses  of  variance  were  executed  to  determine  the  effects 
of  average  segment  length,  average  time  required  for  the  target  to  travel  a 
segment,  maximum  crosshairs  speed,  and  difference  between  maximum  crosshairs 
speed  and  target  speed  on  tracking  accuracy.  All  four  Item  parameters  were 
significantly  related  to  accuracy  score,  with  crosshairs  speed  accounting  for 
the  most  variance  and  average  segment  length  for  the  least.  It  should  be 
noted  again  that  all  four  parameters  were  highly  intercorrelatec!. 

Modifications  for  the  Fort  Knox  Field  Test.  Changes  In  Target  Tracking 
Test  1  for  fort  knox  mirrored  those  made  for  Target  Tracking  Test  1.  Test 
trials  were  changed  completely,  and  the  number  of  Items  was  increased  from  18 
to  27.  However,  the  Items  are  not  the  same  as  those  presented  for  Target 
Tracking  Test  1.  This  was  expected  to  reduce  the  correlations  between  these 
tests  to  some  extent. 


Construct  -  Number  Operations 

This  construct  Involves  the  ability  to  perform,  quickly  and  accurately, 
simple  arithmetic  operations  such  as  addition,  subtraction,  multiplication, 
and  division. 

The  current  ASVAE  Includes  a  numerical  operations  test  containing  50 
very  simple  arithmetic  problems  with  a  3-minute  time  limit.  Because  of  low 
Item  difficulty  arid  the  speeded  nature  of  the  test,  correlations  with  other 
ASVAB  subtests  Indicate  that  Numerical  Operations  Is  most  strongly  related  to 
Coding  Speed— a  measure  of  perceptual  speed  and  accuracy.  The  present 
military-wide  selection  and  classification  battery,  then,  measures  very  basic 
number  operations  abilities  which  appear  very  similar  to  perceptual  speed  and 
accuracy  abilities. 

The  test  designed  to  assess  number  operations  abilities  was  not  com¬ 
pleted  prior  to  the  Fort  Lewis  pilot  test.  Therefore,  no  data  were  available 
to  evaluate  this  measure  prior  to  field  testing. 


Number  Memory  Test 

This  test  was  modeled  after  a  number  memory  test  developed  by  Dr. 
Raymond  Chrlstal  at  AFHRL.  The  basic  difference  between  the  AFHRL  test  and 
the  Number  Memory  Test  concerns  pacing  of  the  number  Items.  The  former  uses 
machine-paced  presentation,  while  the  latter  Is  self-paced.  Both,  however, 
require  subjects  to  perform  simple  number  operations  such  as  addition,  sub¬ 
traction,  multiplication,  and  division  and  both  Involve  a  memory  task. 

In  the  Number  Memory  Test,  subjects  are  presented  with  a  single  number 
on  the  computer  screen.  After  studying  the  number,  the  subject  is  Instructed 
to  push  a  button  to  receive  the  next  part  of  the  problem.  When  the  subject 
presses  the  button,  the  first  part  of  the  problem  disappears  and  another 
number,  along  with  an  operation  term  such  as  Add  9  r-r  Subtract  6  then 
appears.  Once  the  subject  has  combined  the  first  number  with  the  second, 
he  or  she  must  press  a  button  to  receive  the  third  part  of  the  problem. 
Again,  the  second  part  of  the  problem  disappears  when  the  subject  presses  the 
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button.  This  procedure  continues  until  a  solution  to  the  problem  is 
presented.  The  subject  must  then  Indicate  whether  the  solution  presented  is 
true  or  false.  An  example  number  operation  item  appears  below. 


Item  Set 


+6 

-3 

x2 

-4 

Probe 

Is  16  the  correct  answer? 

Response 

T  F 

White  Blue 


Test  items  vary  with  respect  to  number  of  parts--four,  six,  or  elght-- 
contalned  in  the  single  Item.  Items  also  vary  according  to  the  delay  between 
Item  part  presentation  or  Interstimulus  delay  period.  One-half  of  the  Items 
Include  a  brief  delay  (.5  second)  while  the  other  half  contain  a  lengthier 
delay  (2.5  seconds).  The  test  contained  27  items. 

This  test  Is  not  a  "pure"  measure  of  number  operations,  since  It  also  Is 
designed  to  bring  short-term  memory  into  play. 

As  noted,  the  test  was  not  administered  at  Fort  Lewis.  Analyses 
planned  for  data  from  the  Port  Kn'X  field  test  administration  Included  the 
Impact  of  Item  length  (four,  six,  or  eight)  and  interstimulus  delay  (.5 
second  or  2.5  seconds)  on  reaction  time  and  percent  correct,  as  well  as 
comparisons  of  mean  reaction  time  scores  for  item  parts  requiring  addition, 
subtraction,  multiplication,  and  division.  These  data  will  be  used  to 
identify  the  measures  for  scoring  subject  responses. 


Construct  -  Movement  Judgment 

Movement  judgment  is  the  ability  to  judge  the  relative  speed  and 
direction  of  one  or  more  moving  objects  In  order  to  determine  whei'e  those 
objects  will  be  at  a  given  point  in  time  ana/or  when  those  objects  might 
Intersect. 

Movement  judgment  was  not  one  of  the  constructs  Identified  and  targeted 
for  test  development  by  the  literature  review  or  expert  judgments.  However, 
a  suggestion  by  Dr.  Lloyd  Humphreys,  one  of  Project.  A's  scientific  advisors, 
and  job  observations  we  conducted  at  Fort  Stewart,  Fort  Bragg,  Fort  Bliss, 
Fort  Sill,  and  Fort  Knox  led  us  to  conclude  that  movement  judgment  is 
potentially  important  for  job  performance  in  a  number  of  combat  MOS. 


Cannon  Shoot  Test 

As  part  of  its  Aviation  Psychology  Program,  ^he  AAF  became  interested  in 
motion,  distance,  and  orientation  judgment  and  instituted  development  of  a 
battery  of  motion  picture  and  photograph  tests  (Gibson,  19/’7).  On*'  of  these 


tests  was  the  Estimate  of  Relative  Velocities  Test,  a  paper-and-pencll  mea¬ 
sure.  Each  trial  consisted  of  four  frames.  In  each  frame,  two  airplanes 
were  shown  flying  along  the  same  path  In  the  same  direction.  In  each  subse¬ 
quent  frame,  the  trailing  plane  edged  nearer  the  lead  plane.  The  subject's 
task  was  to  Indicate  on  the  final  frame  where  the  planes  would  Intersect. 
Validities  of  this  test  for  predicting  pilot  training  success  averaged  ap¬ 
proximately  .16  (Gibson,  1947).  The  present  tost  was  designed  to  test  the 
construct  that  seems  to  underlie  the  Estimate  of  Relative  Velocities  Test. 

Thu  Cannon  Shoot  Test  measures  subjects'  ability  to  fire  at  a  moving 
target  in  such  a  way  that  the  shell  hits  the  target  when  the  target  crosses 
the  cannon's  line  of  fire.  At.  the  beginning  of  each  tr-'al,  a  stationary 
cannon  appears  on  the  video  screen.  The  starting  position  of  this  cannon 
varies  from  trial  to  trio!.  The  cannon  Is  "capable"  of  firing  a  shell,  which 
travels  at  a  constant  speed  on  each  trial.  Shortly  after  the  cannon  appears, 
a  circular  target  moves  onto  the  screen.  This  target  move*  in  a  constant 
direction  at  a  constant  rate  o'  speed  throughout  the  trial,  tnough  the  speed 
and  direction  vary  from  t;*ial  to  trial.  The  subject's  task  is  to  push  a 
response  button  to  fire  live  shell  so  that  the  shell  Intersects  the  target 
when  the  target  crosses  the  shell's  line  of  fire. 

Three  parameters  determine  the  nature  of  each  test  trial.  The  first  is 
the  angle  of  the  turget  movement  relative  to  the  position  of  the  cannon*,  12 
different  angles  were  used.  The  second  Is  the  distance  from  the  cannon  to 
the  Impact  point  (1„e.,  the  point  at  which  the  shell  crosses  the  cannon's 
line  of  fire);  four  different  distance  values  were  used.  The  third  parameter 
was  the  distance  from  the  impact  point  to  the  fire  point  (i.e.,  the  point  at 
which  the  subject  must  fire  the  shell  In  order  to  hit  the  center  of  the 
target);  there  were  also  four  values  for  this  distance  parameter.  The  last 
two  parameters  determine  the  speed  of  the  target--that  is,  given  a  fixed 
shell  speed,  impact  point,  and  fire  point,  the  speed  of  the  target  is 
established. 

If  a  completely  crossed  design  had  been  used,  It  would  have  necessitated 
a  minimum  of  192  trials  (i.e.,  12x4x4=192).  Instead,  a  Latin  square  design 
was  employed,  so  the  current  version  of  the  test  Includes  only  48  trials. 
Three  measures  are  assessed  on  each  trial:  (a)  whether  the  shell  hits  or 
misses  the  target;  (b)  the  distance  from  the  shell  to  the  center  of  the 
target  at  the  time  the  target  crosses  the  impact  point;  and  (c)  the  distance 
from  the  center  of  the  target  to  the  fire  point  at  the  time  the  shell  Is 
fired.  The  Fort  Knox  field  test  data  will  be  analyzed  to  determine  which  of 
these  three  measures  is  most  reliable.  This  test  was  not  administered  at 
Fort  Lewis. 


Summary  of  Pilot  Test  Results  for 
Computer-Administered  Tests 

Table  11.18  shows  the  means,  standard  deviations,  and  split-half 
reliabilities  for  24  scores  computed  from  eight  computer  tests  administered 
at  the  Fort  Lewis  pilot  test,  and  Table  11.19  shows  the  Intercorrelations 
between  computer  test  scores.  Table  11.20  shows  the  correlations  between 
computer -administered  test  scores  and  cognitive  paper-and-pencil  test  scores. 
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Table  11.18 


Means,  Standard  Deviations,  and  Split-Half  Reliability  Coefficients  for 
24  Computer  Measure  Scores  Based  on  Fort  Lewis  Pilot  Test  Data  (N  -  112) 


Split-Half® 


Mean 

SD 

Type  of  ' 

SIMPLE  REACTION  TIME  (10  Items) 

Mean  Decision  Time  (h$)° 

29.25 

8.10 

.92 

Mean  Total  Reaction  Time  (hs) 

55.92 

13.86 

.94 

Trimmed  Standard  Deviation  (hs) 

11.79 

16.80 

.66 

Percent  Correct 

99 

3 

-.01 

CHOICE  REACTION  TIME  (IB  Items) 

Mean  Decision  Time  (hs) 

36.78 

7.75 

.94 

Mean  Total  Reaction  Time  (hs) 

65.98 

10.39 

.91 

Standard  Deviation  (hs) 

8.92 

3.75 

.10 

Percent  Correct 

99 

3 

-» 16 

DIFFERENCE  IN  SIMPLE  S  CHOICE  REACTION  TIME 

Decision  Time  (hs) 

7.68 

8.79 

.86 

Total  Time  (hs) 

10.37 

11.15 

.79 

SHORT-TERM  MEMORY  (50  Items) 

Intercept  (hs) 

97.53 

30.28 

.84 

Slope  (hs) 

7.19 

6.14 

.54 

Percent  Correct 

90 

10 

.95 

Grand  Mean  (hs) 

119.05 

29.84 

.88 

PERCEPTUAL  SPEED  &  ACCURACY  (80  Items) 

Intercept  (hs) 

89.37 

36.48 

.85 

Slope  (hs) 

33.14 

9.78 

.89 

Percent  Correct 

87 

8 

.81 

Grand  Mean  (hs) 

294.22 

57.13 

.97 

TARGET  IDENTIFICATION  (44  Items) 

Mean  Total  Time  (hs) 

218.51 

68.75 

.97 

Percent  Correct 

93 

8 

.78 

TARGET  TRACKING  1  (18  Items) 

Mean  Distance  (m  |/m  pixels)0 

1.44 

.45 

.97 

TARGET  TRACKING  2  (18  Items) 

Mean  Distance  (m  /m  pixels) 

2.01 

.64 

.97 

TARGET  SHOOT  (40  Items) 

Mean  Total  Distance  (m  /m  pixels) 

2.83 

.52 

.93 

Percent  "Hits" 

58 

13 

.78 

a  Odd-even  Item  correlation  corrected  to  full  test  length  with  the 
Spearman-Brown  formula. 

b  hs  ■  hundredths  of  seconds. 

c  m  |/m  pixels  *  mean  of  the  square  root  of  the  mean  distance  from 
target,  computed  across  all  trials. 
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One  concern  we  had  prior  to  the  Fort  Lewis  pilot  test  w as  the  extent  to 
which  computer  measure  scores  would  be  affected  by  differences  between  test¬ 
ing  stations  (a  testing  station  Is  one  Compaq  computer  and  the  associated 
response  pedestal;  six  such  testing  stations  were  used  at  Fort  Lewis).  Re¬ 
call  that  we  mentioned  In  Section  1  that  differences  across  testing  appara¬ 
tus  and  unreliability  of  testing  apparatus  had  been  a  problem  in  World  War  II 
psychomotor  testing  and  thereafter.  The  recent  advent  of  microprocessor 
technology  was  viewed  as  alleviating  such  problems,  at  least  to  some  degree. 

We  ran  some  analyses  of  variance  to  provide  an  Initial  look  at  the 
extent  of  this  problem  with  our  testing  stations.  Thirteen  one-way  ANOVAs 
were  run  with  testing  stations  as  levels  and  computer  test  scores  as  the 
dependent  variables.  We  ran  separate  ANOVAs  for  white  males  and  non -white 
males  In  order  to  avoid  confounding  the  results  with  possible  subgroup  dif¬ 
ferences.  Also,  only  five  testing  stations  were  used  since  one  station  did 
not  have  enough  subjects  assigned  to  It. 

Of  the  26  ANOVAs,  only  1  reached  significance  at  .05  level,  about  what 
would  be  expected  by  chance.  These  results  were  heartening.  One  reason  for 
these  results  was  the  use  of  calibration  software,  which  adjusted  for  the 
idiosyncratic  differences  of  each  response  pedestal,  ensuring  a  more  stan¬ 
dardized  test  administration  across  testing  stations. 

The  results  of  the  Fort  Lewis  pilot  test  of  the  computer-administered 
measures  In  the  Pilot  Trial  Battery  were  extremely  useful.  The  results 
showed  very  high  promise  for  these  measures.  In  addition,  the  soldiers  liked 
the  test  battery.  Virtually  every  soldier  expressed  a  preference  for  the 
computerized  tests  compared  to  the  paper-and-pencll  tests.  We  thought  there 
were  several  resons  for  this:  novelty;  the  game-like  nature  of  several 
tests;  and  the  fact  that  the  battery  was,  In  large  part,  self-paced,  allowing 
each  soldier  to  thoroughly  understand  the  Instructions  and  to  work  through 
the  battery  at  his  or  her  own  speed. 

Field  testing  of  the  computer-administered  measures  is  described  In 
Section  6. 


Section  5 


DEVELOPMENT  OF  NON-COGNITIVE  MEASURES 


This  section  describes  the  non-cognltlve  measures  developed  for  the 
Pilot  Trial  Battery,  All  are  paper-and-pencll  measures  and  the  Inventories 
are  Intended  to  assess  constructs  In  the  temperament,  Interests,  and  life 
history  (biodata)  domains. 

The  discussion  of  these  scales  Is  org«,,lzed  around  the  two  inventories 
that  were  developed  for  Project  A.  The  ABLE  (Assessment  of  Background  and 
Life  Experiences)  contains  Items  that  assess  the  Important  constructs  of  the 
temperament  and  life  history  (biodata)  domains.  The  AVOICE  (Army  Vocational 
Interest  Career  Examination)  measures  Levant  constructs  pertaining  to 
vocational  Interests. 

ucall  from  Part  II,  Section  1,  that  the  non-cognltlve  domain  of 
selection  Information  was  defined  and  specified  by  a  three-part  strategy. 
Flrsi,  a  comprehensive  literature  review  was  used  to  generate  an  exhaustive 
list  of  potential  non-cognltlve  Indicators  of  relevant  Individual 
differences.  On  the  basis  of  professional  staff  judgments,  the  list  was 
reduced  to  a  non-redundant  list  of  variables.  The  existing  validity  evidence 
was  then  summarized  around  this  list  of  variables.  A  brief  summary  of  those 
results  Is  presented  In  Tables  11.21,  11.22,  and  11.23. 

The  second  part  of  the  strategy  was  to  Include  the  temperament  and 
biographical  variables  In  the  expert  judgment  forecasts  of  validity 
coefficients  described  In  Section  1.  The  predicted  profiles  of  validity 
coefficients  for  each  temperament  and  biographical  variable  were  then 
Intercorrelated  and  clustered  to  generate  a  kind  of  higher  order  construct. 

The  third  part  of  the  strategy  consisted  of  examining  the  empirical 
covariation  matrix  generated  by  the  temperament  and  biographical  measures 
that  were  Included  In  the  Preliminary  Battery  (l.e.,  the  "off-the-shelf" 
measures  administered  to  the  samples  of  new  recruits  In  the  four  MOS  In  the 
Preliminary  Battery  sample).  Factor  analyses  of  these  data  provided 
additional  guidance  for  how  best  to  define  the  non-cognltlve  criterion  space 
In  Project  A. 

All  three  sources  of  Information  were  discussed  at  some  length  In  a 
series  of  meetings  attended  by  the  relevant  project  staff  and  members  of  the 
Scientific  Advisory  Group.  The  result  of  these  deliberations  was  an  array  of 
constructs  that  were  judged  to  be  the  best  potential  sources  of  valid 
selection/classification  Information  of  a  non-cognltlve  nature.  The  linkages 
among  the  Initial  variable  array,  the  constructs  chosen  for  measurement,  the 
variables  proposed  to  reflect  them,  and  the  forecasted  predictor/criterion 
correlations  are  shown  In  Figure  11.17  (Hough,  1984). 
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Table  11.21 


Summary  of  Criterion-Related  Validities  for  Interest  Inventories 


n 


Criterion 

Median  r 

Training 

oo 

CM 

• 

Joo  Proficiency 

.27 

Job  Involvement 

.30 

Table  11.22 

Summary  of  Criterion-Related  Validities  for  Biographical 

Inventories 

Criterion 

Median  r 

Training 

.24 

Job  Proficiency 

.32 

Job  Involvement 

.29 

Unfavorable  Military  Dlschange 

.27 

Substance  Abuse 

.26 

Del Inquency 

.20 
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Table  11.23 

St— ary  of  Criterion-Related  Validities  of  Tenpera— it  Constructs8 


P r a llmlnary  Battery  Variables 


Stress  Reaction——— 
Social  Cunfldanc*— .  — — — 
Sibling  Harmony-———-' 

Indtpandanca  - 

Antacadanti  of  Sal  f-Estaea 
Parantal  Closeneas— — — 


Socialisation 


Sal  f-Control- 


Acadamlc  Achievement— — 

Acadomlr,  Confidence  . — - 

Positive  Acadamlc  Attltuda 
Ef *6  rt  »  — — — 


Locus  of  Control- 


Athlatlc/Sports  Participator 

Physical  Condition— . — — - 

Athletic  Intaraits . . 


Leadershlj 


Social  Potanc 


Social  Actlvlty- 
Teechar/Counselo 


pood  Sarvica- 


Vocational /Tach.  Activities— 

Mechanics . . 

Haavy  Construction..  . 

Ilarksman - -  ■  - 

El  actronlea— — - - 

Outdoors..  . — 

Agrlcultura  .  . — 

Law  Enforcamanti. 

Craftsrr.au  .  .  —  - 


Aesthetics 


Audlographle 


Office  Admlnlstratlor 


Variables  Initially 
Proposed  for  Pilot 

Literature  Review  Trial  Battery 


Adjustment— 


— d, 


ielf-Esteem  (S)* 


letlonal  Stability  (P) 


Sc  1  enci  — 

Scientific  Interests - 

Readlng/Inte1 1  ectual  Interaats 

Medical  Sarvica  — 

Mathematics————  —— 
Automatad  Data  Processing— — 

Draft Ing...  •.  ■  . — 


Agraaablanass/ 

Llkesbll ity 


Cooparatlvanaaa  (P) 


Capendabll 


EConsclintlousness  (N) 
Traditional  Values  (H) 
Ho ndal Inquancy  (N) 


Ac h lavement  . Work  Orientation  (0) 


Locus  of  Control- 


Locua  of  Control  (0) 


-Physical  Condition 


tnvolviment  In 
-Athltt'lci  (T) 
and  Physical 
Conditioning 


‘Energy  Laval  (T) 


Dominance  (S) 


i Sociability  (Q) 


-Social  mtiraftaZ— Social  Interaats  (Q) 


Rail  1  stlc— 
Interests 


Peel  1* tic  Interests  ( N ) 


.Investigative 

Intarasti 


-Investigative  Interests  (U) 


Enterprising 

Interests 


-Enterprising  Intaraits  (R) 


Artistic— 

Interests 


Artistic  Interests  (H) 


Conventional- 

Interest 


Conventional  Interests  (N) 


Experts' 

Predicted  Correlations 
with  Best  Appropriate 
Cluster  of  Army  Criteria 


_,24  (Personal  tntaractlon) 
/~.24  (Commitment) 

.24  (Personal  Interaction) 


.2?  (Commitment,  Discipline) 


rt32 
r-.33  \  .29 
.33 


[Commitment,  Discipline) 
Commitment,  Discipline 
(Commitment,  Discipline) 


..37  r.40 

t*33 


(tnltletlve,  Effort) 
(commitment,  Discipline) 


.32  (Initiative,  Effort) 


.39  (Physical  Combat) 


.3(1  (Initiative,  Effort) 


.18  (Personal  Interaction) 


.20  (Person*!  Interaction) 


Note:  From  Hough  (1984) 
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Description  of  ABLE  Constructs/Scales 

Following  the  identification  of  the  construct  array,  item  writing  groups 
were  created  and  items  were  written,  revised,  edited,  and  arranged  into 
specific  temperament  and  biographical  scales  that  were  intended  to  be  valid 
measures  of  the  chosen  constructs.  After  this  initial  phase  of  item  writing, 
revision,  and  scale  creation,  11  substantive  scales  and  four  response  bias 
scales  were  produced.  Table  Hi 24  lists  the  seven  constructs  initially 
chosen  for  measurement  via  the  ARLE,  the  11  scales  subsequently  developed  to 
represent  them,  and  four  validity  scales  developed  by  Project  A. 


Table  11.24 

Temperament/Biodata  Scales  (by  Construct)  Developed  for  Pilot  Trial  Battery: 
ABLE  -  Assessment  of  Background  and  Life  Experiences 


Construct  Scale 


Adjustment 

Dependability 

Achievement 

Physical  Condition 
Leadership  (Potency) 

Locus  of  Control 
Agreeableness/Li keabil ity 
Response  Validity  Scales 


Emotional  Stability 

Nondelinquency 
Traditional  Values 
Conscientiousness 

Work  Orientation 
Self-Esteem 

Physical  Condition 

Dominance 
Energy  Level 

Internal  Control 

Cooperativeness 

Non-Random  Response 

Unlikely  Virtues  (Social  Desirability) 

Poor  Impression 

Sel f-Knowl edge 


We  now  discuss,  in  turn,  each  construct  and  the  scales  developed  to 
measure  that  construct.  The  description  of  the  number  of  items  on  each  scale 
refers  to  the  Fort  Campbell  pilot  test  version. 
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Adjustment 

Adjustment  Is  defined  as  the  amount  of  emotional  stability  and  stress 
tolerance  that  one  possesses.  The  well-adjusted  person  Is  generally  calm, 
displays  an  even  mood,  and  Is  not  overly  distraught  by  stressful  situations. 
He  or  she  thinks  clearly  and  maintains  composure  and  rationality  In 
situations  of  actual  or  perceived  stress.  The  poorly  adjusted  person  Is 
nervous,  moody,  and  easily  Irritated,  tends  to  worry  a  lot,  and  "goes  to 
pieces"  In  times  of  stress. 

The  scale  Included  under  the  Adjustment  construct  Is  called  Emotional 
Stability.  Emotional  Stability  Is  a  31-Item  scale  that  contains  ■‘terns  such 
as  "Have  you  ever  felt  sick  to  your  stomach  when  you  thought  about  something 
you  had  to  do?"  and  "Do  you  handle  pressure  better  then  most  other  people?" 
The  scale  Is  designed  to  assess  a  person's  characteristic  affect  and  ability 
to  react  to  stress. 


Dependability 

The  Dependability  construct  refers  to  a  person's  characteristic  degree 
of  conscientiousness.  The  dependable  person  Is  disciplined,  well-organized, 
planful,  respectful  of  laws  and  regulations,  honest,  trustworthy,  wholesome, 
and  accepting  of  authority.  Such  a  person  prefers  order  and  thinks  before 
acting.  The  less  dependable  person  Is  unreliable,  acts  on  the  spur  of  the 
moment,  and  Is  rebellious  and  contemptuous  of  laws  and  regulations.  Three 
ABLE  scales  fall  under  the  Dependability  construct--Nondel1nquency, 
Traditional  Values,  and  Conscientiousness. 

Nondelinquency  1?  a  24-1tem  scale  that  assesses  how  often  a  person  has 
violated  rules,  laws,  or  social  norms.  It  Includes  Items  such  as  ''How  often 
have  you  gotten  Into  fights?",  "Before  joining  the  Army,  how  hard  did  you 
think  learning  to  take  orders  would  be?",  and  "How  many  times  were  you 
suspended  or  expelled  from  high  school?" 

Traditional  Valuta,  a  19-Item  scale,  contains  Items  such  as  "Are  you 
more  strict  about  right  and  wrong  than  most  people  your  age?"  and  "People 
should  have  greater  respect  for  authority.  Do  you  agree?"  These  Items  assess 
how  conventional  or  strict  a  person's  value  system  Is,  and  how  much 
flexibility  he/she  has  In  this  value  system. 

Conscientiousness,  the  third  Dependability  scale,  Includes  24  Items. 
This  scale  assesses  the  respondent's  degree  of  dependability,  as  well  as  the 
tendency  to  be  organized  and  planful .  Items  Include:  "How  often  do  you  keep 
the  promises  you  make?",  "How  often  do  you  act  on  the  spur  of  the  moment?", 
and  "Are  you  more  neat  and  orderly  than  most  people?" 


Achievement 

The  Achievement  construct  Is  defined  as  the  tendency  to  strive  for 
competence  In  one's  work.  The  achlevement/work-orlented  person  works  hard, 
sets  high  standards,  tries  to  do  a  good  job,  endorses  the  work  ethic,  and 
concentrates  or  and  persists  In  completion  of  the  task  at  hand.  Tills  person 
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Is  also  confident,  feels  success  from  past  undertakings,  and  expects  to 

succeed  In  the  future.  The  person  who  Is  less  achievement-oriented  has 
little  ego  Involvement  In  his  or  her  work,  feels  Incapable  and  self-doubting, 
does  not  expend  undue  effort,  and  does  not  feel  that  hard  work  Is  desirable. 

Two  scales  fall  under  the  Achievement  construct,  Including  a  31-Item 
scale  entitled  Work  Orientation,  This  scale  addresses  how  long,  hard,  and 
well  the  respondent  typically  works  and  also  how  he  or  she  feels  about  work. 
Among  the  scale  Items  are  these:  "How  hard  were  you  willing  to  work  for  good 
grades  In  high  school?"  and  "How  Important  is  your  work  to  you?" 

The  other  scale  pertaining  to  Achievement  Is  called  Self-Esteem,  a 
16-Item  scale  that  measures  how  much  a  person  believes  In  him/herself  and  how 
successful  he  or  she  expects  to  be  In  life.  Items  from  this  scale  Include: 
"Do  you  believe  you  have  a  lot  to  offer  the  Army?"  and  "Has  your  life  so  far 
been  pretty  much  a  failure?" 


Physical  Condition 


The  optimal  way  to  establish  physical  condition  Is,  of  course,  to 
administer  physical  conditioning  tests.  However,  since  such  a  program  was 
not  a  part  of  the  trial  battery,  it  was  decided  to  ask  self-report  questions 
pertaining  to  perceived  physical  fitness  levels. 

The  Physical  Condition  construct  refers  to  one's  frequency  and  degree  of 
participation  In  sports,  exercise, 'and  physical  activity. 

The  scale  developed  to  tap  this  construct  Includes  14  items  that  measure 
how  vigorously,  regularly,  and  well  the  respondent  engages  in  physical 
activity.  Sample  Items  are  "Prior  to  joining  the  Army,  how  did  vour  physical 
activity  (work  and  recreation)  compare  to  most  people  your  age?'1  and  "Before 
joining  the  Army,  how  would  you  have  rated  your  performance  in  physical 
activities?" 


Leadership  (Potency) 

This  construct  Is  defined  as  the  degree  of  impact,  Influence,  and  energy 
that  one  displays.  The  person  high  on  this  characteristic  Is  appropriately 
forceful  and  persuasive,  Is  optimistic  and  vital,  and  has  the  energy  to  get 
things  done.  The  person  low  on  this  characteristic  Is  timid  about  offering 
opinions  or  providing  direction  and  Is  likely  to  be  lethargic  and 
pessimistic. 

Two  ABLE  scales.  Dominance  and  Energy  Level,  are  associated  with  the 
leadership  construct.  Dominance  Is  a  17-Item  scale  that  includes  such  Items 
as  "How  confident  are  you  when  you  tell  others  what  to  do?"  and  "How  often  do 
people  turn  to  you  when  decisions  have  to  be  made?"  The  scale  assesses  the 
respondent's  tendency  to  take  charge  or  to  assume  a  central  and  public  role. 

The  Energy  Level  scale  Is  designed  to  measure  to  what  degree  one  Is 
energetic,  alert,  and  enthusiastic.  This  scale  Includes  21  items,  such  as 
these:  "Do  you  get  tired  pretty  easily?",  "At  what  speed  do  you  like  to 
work?",  and  "Do  you  enjoy  just  about  everything  you  do?" 
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Locus  of  Control 


Locus  of  Control  refers  to  one’s  characteristic  belief  In  the  amount  of 
control  he  or  she  has  or  people  have  over  rewards  and  punishments.  The 
person  with  an  Internal  locus  of  control  expects  that  there  are  consequences 
associated  with  behavior  and  that  people  control  what  happens  to  them  by  what 
they  do.  Persons  with  an  external  locus  of  control  believe  that  what  happens 
Is  beyond  thei r  personal  control. 

The  ABLE  Internal  Control  scale  is  a  21-item  scale  that  assesses  both 
Internal  and  external  control,  primarily  as  they  pertain  to  reaching  success 
on  the  job  and  In  life.  The  following  are  example  items:  "Getting  a  raise 
or  a  promotion  Is  usually  a  matter  of  luck.  Do  you  agree?"  and  "Do  you 
believe  you  can  get  most  of  the  things  you  want  If  you  work  hard  enough  for 
them?" 


Agreeableness/LIkeabllity 

The  Agreeableness/LIkeabllity  construct  is  defined  as  the  degree  of 
pleasantness  versus  unpleasantness  exhibited  In  Interpersonal  relations.  The 
agreeable  and  likeable  person  Is  pleasant,  tolerant,  tactful,  helpful,  not 
defensive,  and  generally  easy  to  get  along  with.  His  or  her  participation  In 
a  group  adds  cohesiveness  rather  than  friction.  The  relatively  disagreeable 
and  unllkeable  person  Is  critical,  fault-finding,  touchy,  defensive, 
alienated,  and  generally  contrary. 

The  ABLE  Cooperativeness  scale  Is  composed  of  28  Items  Intended  to 
assess  how  easy  It  Is  to  get  along  with  the  person  making  the  responses. 
Items  include:  "How  often  do  you  lose  your  temper?",  "Would  most  people 
describe  you  as  pleasant?",  and  lTHow  well  do  you  accept  criticism?" 


Validity  Scales 

The  primary  purpose  of  these  scales  Is  to  determine  the  validity  of 
responses,  that  Is,  the  degree  to  which  the  responses  are  accurate  depictions 
of  the  person  completing  the  Inventory.  Four  validity  scales  are  Included: 
Non-Random  Response,  Unlikely  Virtues,  Poor  Impression,  and  Self-Knowledge. 

Non-Random  Response.  The  response  options  for  this  scale  are  composed 
of  one  right  answer,  scored  as  one,  and  two  response  options  that  are  both 
wrong  and  are  both  scored  zero.  The  content  asks  about  Information  that  any_ 
person  is  virtually  certain  to  know.  Two  of  the  eight  Items  from  The 
Non-Random  Response  scale  are: 

"The  branch  of  the  military  that  deals  most  with  airplanes  Is  the: 

1.  Military  Police 

2.  Coast  Guard 

3.  A1 r  Force" 
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"Groups  of  soldiers  are  called: 


1.  Tribes 

2.  Troops 

3.  Weapons" 

The  intent  of  this  scale  is  to  detect  those  respondents  who  cannot  or 
are  not  reading  the  questions,  and  are  instead  randomly  filling  in  the 
circles  on  the  answer  sheet. 

Unlikely  Virtues.  This  scale  Is  aimed  at  detecting  those  who  respond  In 
a  socially  desirable  manner  {l.e.,  "fake  good").  There  are  12  items,  such 
as:  "Do  you  sometimes  wish  you  had  more  money?"  or  "Have  you  always  helped 
people  without  even  the  slightest  bit  of  hesitation?" 

Poor  Impression.  This  scale  does  not  reflect  psychopathology  but  rather 
an  attempt  to  simulate  psychopathology.  Persons  who  attempt  to  "fake  bad" 
receive  the  most  deviant  scores,  while  psychiatric  patients  score  average  or 
slightly  higher  than  average.  Thus,  this  scale  is  designed  to  detect  those 
respondents  who  wish  to  make  themselves  appear  emotionally  unstable  when  in 
fact  they  are  net. 

The  Poor  Impression  scale  has  23  items,  most  of  which  are  also  scored  on 
another  substantive  ABLE  scale.  Items  Include  "Ht-w  much  resentment  do  you 
feel  when  you  don't  get  your  way?",  "Did  your  high  school  classmates  consider 
you  easy  to  get  along  with?",  and  "How  often  do  .m-u  keep  the  prom-  ises  that 
you  make?"  The  response  option  scored  as  1  is  tne  option  that  Indicates  the 
least  social  desirability, 

Self-Knowledge.  This  13-item  scale  is  Intended  to  Identify  people  who 
are  more  se1fJaware,  more  Insightful,  and  more  likely  to  have  accurate  per¬ 
ceptions  about  themselves.  The  responses  of  persons  high  on  this  scale  may 
have  more  validity  for  predicting  job  criteria.  The  following  are  items  from 
the  Self-Knowledge  scale:  "Do  other  people  know  you  better  than  you  know 
yourself?"  and  "How  often  do  you  think  about  who  you  are?" 


ABLE  Revisions  Based  on  Pilot  Test  Results 

The  non-cognit ive  inventories  were  pilot  tested  at  two  of  the  three 
pilot  test  sites.  Revision  of  the  ABLE  took  place  in  three  steps.  The  first 
was  editorial  revision  prior  to  pilot  testing,  the  second  was  based  on  Fort 
Campbell  results,  and  the  third  was  based  on  Fort  Lewis  findings.  The  edi¬ 
torial  changes  prior  to  pilot  testing  were  made  by  the  research  staff  acting 
on  suggestions  from  both  sponsor  and  contractor  reviews  of  the  instruments. 

The  changes  resulting  from  the  first  editorial  review  consisted  of  the 
deletion  of  17  items  and  the  revision  of  158  items.  The  revisions  largely 
consisted  of  minor  changes  in  wording,  resulting  in  more  consistency  across 
items  in  format,  phrasing,  and  response  options. 
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Fort  Campbell  Pilot  Test 


When  the  Inventory  was  administered  at  Fort  Campbell,  the  respondents 
raised  very  few  criticisms  or  concerns  about  the  ABLE.  Several  subjects  did 
note  the  redundancy  of  the  Items  on  the  Physical  Condition  scale,  and  this 
14-Item  scale  was  shortened  to  9  items. 

Item  analyses  were  based  on  the  data  from  52  of  the  56  Fort  Campbell 
subjects.  The  four  excluded  were  two  who  had  more  than  1055  missing  data  and 
two  who  answered  fewer  than  seven  of  the  eight  Non-Random  Response  scale 
items  "correctly."  The  two  statistics  that  were  examined  for  each  ABLE  Item 
were  Its  correlation  with  the  total  scale  on  which  It  is  scored  and  the  en¬ 
dorsement  frequencies  for  all  of  Its  response  options.  Items  that  failed  to 
correlate  at  least  .15  In  the  appropriate  direction  with  their  respective 
scales  were  considered  potentially  weak.  Items  (other  than  validity  scale 
Items)  for  which  one  or  more  of  the  response  options  was  endorsed  by  fewer 
than  two  subjects  (l.e.,  <4%  of  the  sample)  were  also  Identified.  Six  Items 
fell  Into  the  former  category  and  68  Items  fell  into  the  latter,  and  an  ad¬ 
ditional  7  Items  fell  Into  both.  Many  of  these  81  Items  had  already  been 
revised  or  deleted  during  the  editorial  process.  However,  all  of  them  were 
examined  for  revision  and  deletion,  as  appropriate.  As  a  result,  15  Items 
were  revised  for  the  first  time,  18  Items  were  further  revised,  and  6  ad¬ 
ditional  Items  were  deleted. 

In  summary,  a  total  of  23  Items  were  deleted  and  173  Items  revised  on 
the  basis  of  the  editorial  review  and  Fort  Campbell  findings.  Those  Items 
that  were  deleted  were  those  that  did  not  "fit  well"  either  conceptually  or 
statistically,  or  both,  with  the  other  Items  In  the  scale  and  with  the  con¬ 
struct  In  question.  If  the  Item  appeared  to  have  a  "good  fit"  but  was  not 
clear  or  did  not  elicit  sufficient  variance,  It  was  revised  rather  than 
deleted.  The  ABLE,  which  had  begun  at  291  Items,  was  now  a  revised  26C-1tem 
Inventory  to  be  administered  at  Fort  Lewis. 


Fort  Lewis  Pilot  Test 


The  ABLE  was  completed  by  118  soldiers  during  the  pilot  testing  at  Fort 
Lewis.  One  of  the  118  inventories  was  deleted  from  analysis  because  data 
were  missing  for  more  than  1055  of  the  items,  and  another  11  Inventories  were 
deleted  because  fewer  than  seven  of  the  eight  Non-Random  Response  scale  Items 
were  answered  "correctly."  Thus,  the  remaining  sample  size  was  106. 

Item  response  frequency  distributions  were  examined  to  detect  Items  with 
relatively  little  discriminatory  power.  There  were  only  three  items  where 
two  of  the  three  response  choices  were  endorsed  by  less  than  10*5  of  the 
sample  (not  Including  validity  scale  Items).  After  examining  the  content  of 
these  three  Items,  It  was  decided  to  leave  two  of  them  Intact,  and  delete 
one.  Twenty  Items  were  revised  because  one  of  the  three  response  choices  was 
endorsed  by  less  than  1 C%  of  the  sample. 

Overall,  the  Inventory  appeared  to  be  functioning  well  and  only  minor 
revisions  were  required.  On  the  following  pages,  the  psychometric  data 
obtained  during  the  two  pilot  tests  are  presented. 


Scale  Statistics  and  Intercorrelations 


Table  11.25  presents  means,  standard  deviations,  mean  Item-total 
correlations,  and  Hoyt  Internal  consistency  reliabilities  for  each  ABLE  scale 
In  each  of  the  two  pilot  samples.  In  Table  11.26  she  scale  Intercorrelatlons 
are  shown,  except  for  the  Non-Random  Response  and  Poor  Impression  validity 
scales.  It  Is  Interesting  to  note  the  low  correlations  between  the  Unlikely 
Virtues  scale,  which  Is  an  Indicator  of  social  desirability,  and  the  other 
scales.  This  finding,  although  based  on  small  samples,  suggests  that 
soldiers  were  not  responding  only  In  a  socially  desirable  fashion,  but  were 
Instead  responding  honestly. 

Table  I I. 25 

ABLE  Scale  Statistics  for  Fort  Campbell  and  Fort  Lewis  Pilot  Samples9 


No. 


so 

r^-i 


ItOTWTOtil 

Correlation 

1 F 


AOJUSTMENT 


*  Column  A  Indicates  Fort  Campbell  (N  ■  52)  and  Column  B  Fart  Lewis  (N  ■  106), 
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Hoyt 

RillablHt 


Emotional  Stability 

31 

30 

72.0 

69.0 

9.1 

8.6 

.47 

.46 

.87 

.87 

DEPENDABILITY 

Nondal 1 nquency 

24 

2S 

SS.9 

59.1 

6.3 

6.3 

.40 

.40 

.80 

.78 

Traditional  Valuta 

19 

16 

43.8 

32.4 

4.8 

4.3 

.39 

.41 

.73 

.67 

Contel anti  Quinta* 

24 

21 

68.0 

50.2 

5.8 

5.3 

.41 

.41 

.eo 

.75 

ACHIEVEMENT 

Work  Orlantatlon 

31 

27 

74.5 

62.9 

8.0 

7.8 

.42 

.48 

.84 

.84 

.86 

Self-Eiteem 

u 

15 

37.4 

34.9 

5.0 

4.7 

.54 

.52 

.80 

LEADERSHIP  (POTENCY) 

Dominance 

17 

16 

37.7 

36.6 

5.0 

6.1 

.53 

.57 

.78 

.86 

Energy  Laval 

27 

25 

61.3 

59.3 

7.2 

7.4 

.46 

.52 

.85 

.88 

LOCUS  OF  CONTROL 

Intarnal  Control 

21 

21 

51.0 

49.9 

6.3 

6.3 

.46 

.46 

.84 

.80 

AQREEABLENESS/LIKEABILITY 

Cooptratlvenaaa 

•  28 

25 

63.8 

56.4 

7.0 

6.7 

.39 

.43 

.82 

.81 

PHYSICAL  CONDITION 

Phyalcal  Condition 

14 

.  9 

43.1 

31.3 

9.7 

7.0 

.66 

.73 

.92 

.87 

ABL^  Validity  Scales 

Non-Random  Rtsponaa 

B 

3 

mm 

7.5 

mm 

.7 

.43 

Unlikely  Virtues 

12 

12 

18.0 

16.6 

3.2 

3.5 

.38 

.48 

.37 

.71 

Self-knowledge 

13 

13 

31.4 

29.8 

3.7 

4.0 

.43 

.46 

.61 

.71 

V6 
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NOTE:  Decimals  have  been  omitted 


In  addition  to  the  ABLE,  four  well-established  measures  of  temperament 
had  been  administered  to  46  Fort  Campbell  soldiers  to  serve  as  marker  vari¬ 
ables.  They  were  the  Socialization  scale  of  the  California  Psychological 
Inventory,  Rotter's  Locus  of  Control  scale,  and  the  Stress  Reaction  scale  and 
Social  Potency  scale  of  the  Differential  Personality  Questionnaire.  The  four 
scales  had  also  been  used  earlier  as  part  of  the  Preliminary  Battery  tempera¬ 
ment  Inventory,  known  as  the  Personal  Opinion  Inventory  (POI). 

From  a  total  of  46  soldiers  completing  the  Instruments*  the  responses  of 
38  were  used  to  compute  correlations  between  ABLE  scales  and  the  markers. 
Results  are  shown  In  Table  11.27.  While  these  results  are  based  on  a  small 
sample,  they  do  Indicate  that  the  ABLE  scales  appear  to  be  measuring  the  con¬ 
structs  they  were  Intended  to  measure. 


TABLE  11.27 

Correlations  Between  ABLE  Constructs  and  Scales  and  Personal  Opinion 
Inventory  (POI)  Marker  Variables4:  Fort  Campbell  Pilot  Test 


POI  Scale 

Rotter 

DPQ  Stress  DPQ  Social  Locus  of 

CPI 

ABLE  Construct 

Reaction 

Potency 

Emotional  Stability 

Dominance 

-.70 

-.24 

.32 

.67 

.30 

.18 

.32 

.22 

Internal  Control 

-.32 

.26 

.67 

.60 

Nondelinquency 

-.34 

.10 

.32 

.62 

Using  the  Fort  Lewis  data  (N  =  106)  the  correlation  matrices  for  the  10 
ABLE  substantive  scales  were  factor  analyzed,  both  with  and  without  the 

social  desirability  variance.  Principal  factor  analyses  were  used,  with 
rotation  to  simple  structure  by  varimax  rotation.  Both  factor  matrices 
appear  In  Table  11.28. 

The  structure  of  the  temperament  and  biodata  domain,  as  measured  by  the 
ABLE  during  the  pilot  tests,  could  not  be  specified  with  certainty  due  to  the 
relatively  small  pilot  test  sample  upon  which  the  correlational  and  factor 
analyses  were  run.  The  larger  Concurrent  Validation  samples  will  provide 
more  definitive  Information.  The  scales  do,  however,  appear  to  be  measuring 
the  same  content  as  the  marker  variables  that  were  a  part  of  the  Personal 
Opinion  Inventory  (Preliminary  Battery).  The  Internal  consistency  reliabil¬ 
ities  and  score  distribution  of  the  ABLE  scales  are  more  than  adequate. 


Description  of  Interest  (AVOICE)  Constructs/Scales 

The  seminal  work  of  John  Holland  (1066)  has  resulted  In  widespread 
acceptance  of  a  six-construct,  hexagonal  model  of  Interests.  Our  principal 
problem  In  the  development  and  testing  of  an  Interests  measure  was  not  which 
constructs  to  measure,  but  rather  how  much  emphasis  should  be  devoted  to  the 
assessment  of  each. 

The  Interest  Inventory  used  In  the  Preliminary  Battery  Is  called  the 
VOICE  (Vocational  Interest  Career  Examination),  and  was  originally  developed 
by  the  U.S.  Air  Force.  This  Inventory  served  as  the  starting  point  for  the 
AVOICE  (Army  Vocational  Interest  Career  Examination).  The  Intent  for  the 
AVOICE  was  to  measure  all  six  of  Holland's  constructs,  as  well  as  provide 
sufficient  coverage  of  the  vocational  areas  most  Important  In  the  Army. 
Table  11.29  shows  the  six  Interest  constructs  assessed  by  the  AVOICE  together 
with  their  associated  scales.  The  Basic  Interest  Item,  one  of  which  Is 
written  for  each  Holland  construct,  describes  a  person  with  prototyplc  In¬ 
terests.  The  respondent  Indicates  how  well  this  description  fits  him/her. 

In  addition  to  the  Holland  constructs  and  associated  scales,  the  AVOICE 
also  Included  six  scales  dealing  with  organizational  climate  and  environment 
and  an  expressed  interests  scale.  Table  11.30  shows  these  variables  and 
asoclated  measures. 

As  used  In  the  pilot  testing,  the  AVOICE  Included  306  Items.  Nearly  all 
Items  were  scored  on  a  five-point  scale  that  ranged  from  "Like  Very  Much" 
(scored  5)  to  "Dislike  Very  Much"  (scored  1).  Items  In  the  Expressed  In¬ 
terests  scale  were  scored  on  a  three-point  scale  In  which  the  response 
options  were  different  for  each  item,  yet  one  option  always  reflected  the 
most  interest,  one  moderate  Interest,  and  one  the  least  Interest. 

Each  construct/category  and  the  scales  developed  for  It  are  now 
discussed  In  turn. 


TABLE  11.28 


Varlmax  Rotated  Principal  Factor  Analyses  of  10  ABLE  Scales: 
Fort  Lewis  Pilot  Test 


Five-Factor  Solution 

With  Social  Desirability  Variance 

Included 

ABLE  Scale 

3T- 

m 

rv 

V 

Dominance 

um 

.15 

.16 

.00 

.21 

Energy  Level 

.45 

.19 

.32 

.22 

tm 

Self-Esteem 

(HD 

.13 

.22 

.30 

.27 

Internal  Control 

.33 

gs n 

.15 

.44 

.29 

Traditional  Value 

.18 

ED 

.29 

.22 

.10 

Nondelinquency 

.09 

.50 

G ID  . 

.41 

.09 

Conscientiousness 

.40 

.34 

ED 

.14 

.16 

Work  Orientation 

.57 

.25 

ren 

.15 

.24 

Emotional  Stability 

.33 

.11 

.02 

.43 

ED 

Cooperativeness 

.08 

.30 

.21 

urn 

.22 

Five-Factor  Solution 

With  Social  Desirability  Variance  Partlaled  Out 

T 

TT- 

m 

IV 

V 

MM 

Dominance 

[7551 

.15 

.23 

-.03 

.18 

Energy  Level 

.39 

.18 

[7571 

.13 

.36 

Sel f-Esteem 

£ m 

.12 

.32 

.24 

.19 

Internal  Control 

.31 

[7571 

.34 

.40 

.14 

Traditional  Values 

.17 

[7571 

.10 

.17 

.18 

Nondelinquency 

.08 

1.56  | 

.06 

.40 

.42 

Conscientiousness 

.40 

.37 

.11 

.11 

m 

Work  Orientation 

.57 

.27 

.22 

.13 

hd 

Emotional  Stability 

.30 

.08 

LIU 

.35 

.06 

Cooperativeness 

.06 

.31 

.26 

[7751 

.13 
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Table  11.29 


Holland  Basic  Interest  Constructs,  and  Army  Vocational  Interest  Career 
Examination  (AVOICE)  Scales  Developed  for  Pilot  Trial  Battery 


Construct 

Scale 

Realistic 

Basic  Interest  Item 
Mechanics 

Heavy  Construction 

El ectronlcs 

Electronic  Communication 
Drafting 

Law  Enforcement 
Audiographics 

Agriculture 

Outdoors 

Marksman 

Infantry 

Armor/Cannon 

Vehicle  Operator 

Adventure 

Conventional 

Basic  Interest  Item 

Office  Administration 
Supply  Administration 

Food  Service 

Social 

Basic  Interest  Item 
Teaching/Counseling 

Investigative 

Basic  Interest  Item 
Medical  Services 
Mathematics 
Science/Chemical 

Automated  Data  Processing 

Enterprising 

Basic  Interest  Item 
Leadership 

Artistic 

Basic  Interest  Item 
Aesthetics 

Table  11.30 


Additional  AVOICE  Measures:  Organizational  Cllmate/Envlronmebt 
and  Expressed  Interest  Scales 


Construct 

Achievement  (Org.  Cllmate/Envlronment) 
Safety  (Org.  Cllmate/Envlronment) 

Comfort  (Org.  Cllmate/Envlronment) 

Status  (Org.  Cllmate/Envlronment) 
Altruism  (Org.  Cllmate/Envlronment) 
Autonomy  (Org.  Cllmate/Envlronment) 

Expressed  Interests 


Scale 


Achievement 

Authority 

Ability  Utilization 

Organizational  Policies  and 
Procedures 

Supervision  -  Human  Resources 
Supervision  -  Technical 

Activity 

Variety 

Compensation 

Security 

Working  Conditions 

Advancement 
Recognition 
Social  Status 

Co-Workers 
Moral  Values 
Social  Services 

Responsibility 

Creativity 

Independence 

Expressed  Interests 


j 

!  Realistic  Interests 

This  construct  Is  defined  as  a  preference  for  concrete  and  tangible 
?  activities,  characteristics,  and  tasks.  Persons  with  realistic  Interests 

j  enjoy  and  are  skilled  In  the  manipulation  of  tools,  machines,  and  animals, 

but  find  social  and  educational  activities  and  situations  aversive.  Real  1 s- 
i  tic  Interests  are  associated  with  occupations  such  as  mechanic,  engineer,  and 

1  wildlife  conservation  officer,  and  negatively  associated  with  such  occupa¬ 

tions  as  social  worker  and  artist. 
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The  Realistic  construct  is  by  far  the  most  thoroughly  assessed  of  the 
six  constructs  tapped  by  the  AVOICE,  reflecting  that  the  preponderance  of 
work  In  the  Army  Is  of  a  Realistic  nature.  Fourteen  AVOICE  scales  are 
Included,  In  addition  to  the  Basic  Interest  Item. 

Conventional  Interests 

Conventional  Interests  refer  to  one's  degree  of  preference  for  well- 
ordered,  systematic,  and  practical  activities  and  tasks.  Persons  with  con¬ 
ventional  Interests  may  be  characterized  as  conforming,  not  overly  Imagina¬ 
tive,  efficient,  and  calm.  Conventional  Interests  are  associated  with 
occupations  such  as  accountant,  clerk,  and  statistician,  and  negatively 
associated  with  occupations  such  as  artist' or  author. 

In  addition  to  the  Basic  Interest  Item,  three  scales  fall  under  the  Con¬ 
ventional  Interests  construct,  Office  Administration,  Supply  Administration, 
and  Food  Service.  They  have,  respectively,  16,  13,  and  17  Items. 

Social  Interests 

Social  Interests  are  defined  as  the  amount  of  liking  one  has  for  social, 
helping,  and  teaching  activities  and  tasks.  Persons  with  social  Interests 
may  ba  characterized  as  responsible,  Idealistic,  and  humanistic.  Social 
Interests  are  associated  with  occupations  such  as  social  worker,  high  school 
teacher,  and  speech  therapist,  and  negatively  associated  with  occupations 
such  as  mechanic  or  carpenter. 

Besides  the  3as1c  Interest  Item,  only  one  scale  Is  Included  In  the 
AVOICE  for  assessing  Social  Interests,  the  Teaching/Counseling  scale.  This 
70-Item  scale  Includes  Items  such  as  "Give  on-the-job  training,"  "Organize 
and  lead  a  study  group,"  and  "Listen  to  people's  problems  and  try  to  help 
them." 


Investigative  Interests 

This  construct  refers  to  one's  preference  for  scholarly,  Intellectual, 
and  scientific  activities  and  tasks.  Persons  with  Investigative  Interests 
enjoy  analytical,  ambiguous,  and  Independent  tasks,  but  dislike  leadership 
and  persuasive  activities.  Investigative  Interests  are  associated  with  such 
occupations  as  astronomer,  biologist,  and  mathematician,  and  negatively 
associated  with  occupations  such  as  salesman  or  politician. 

Along  with  the  Basic  Interest  Item,  Medical  Services,  Mathematics, 
Science/Chemical,  and  Automated  Data  Processing  are  the  four  AVOICE  scales 
that  tap  Investigative  Interests.  The  scales  differ  In  length,  with  Medical 
Services  containing  24  Items;  Mathematics,  5;  Science/Chemical,  11;  and 
Automated  Data  Processing,  7, 
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Enterprising  Interests 


The  Enterprising  construct  refers  to  one's  preference  for  persuasive, 
assertive,  and  leadership  activities  and  tasks.  Persons  with  enterprising 
Interests  may  be  characterized  as  ambitious,  dominant,  sociable,  and 
self-confident.  Enterprising  Interests  are  associated  with  such  occupations 
as  salesperson  an.'1  business  executive,  and  negatively  associated  wltn 
occupations  such  as  biologist  or  chemist. 

Besides  the  Basic  Interest  Item,  only  one  AVOICE  scale  assesses  the 
respondent's  Enterprising  Interests.  This  scale,  entitled  Leadership, 
contains  six  Items. 


Artistic  Interests 

This  final  Holland  construct  Is  defined  as  a  person's  degree  of  liking 
for  unstructured,  expressive,  and  ambiguous  activities  and  tasks.  Persons 
with  artistic  Interests  may  be  characterized  as  intuitive,,  Impulsive, 
creative,  and  non-conforming.  Artistic  Interests  are  associated  with  such 
occupations  as  writer,  artist,  and  composer,  and  negatively  associated  with 
occupations  such  as  accountant  or  secretary. 

In  addition  to  the  .Basic  Interest  Item,  the  AVOICE  Aesthetics  scale  Is 
designed  to  tap  Artistic  Interests,  and  Includes  five  Items. 


Organ! zatlonal  Cl Imate/Envl ronment 

Six  constructs  that  pertain  to  a  person's  preference  for  certain  types 
of  work  environments  and  conditions  are  assessed  by  the  AVOICE  through  20 
scales  of  two  Items  each.  These  environmental  constructs  Include 
Achievement,  Safety,  Comfort,  Status,  Altruism,  and  Autonomy.  Tii-  Items  that 
assess  these  constructs  are  distributed  throughout  the  AVOICE,  and  are 
responded  to  in  the  rame  manner  as  the  interests  items,  that  is,  "Like  Very 
Much"  to  "Dislike  Very  Much." 

Because  the  scales  contain  only  two  items  each  and  for  ease  of  presenta¬ 
tion,  Figure  11.18  shows  the  constructs,  scales,  and  an  Item  from  each  scale. 


Expressed  Interests 

Although  not  a  psychological  construct,  expressed  Interests  were 
Included  In  the  AVOICF.  because  of  the  extensive  research  showing  their 
validity  In  crlterlon-rel ated  studies.  These  studies  had  measured  expressed 
Interests  simply  by  asking  respondents  what  occupation  or  occupational  area 
was  of  most  interest  to  them.  In  the  AVOICE,  such  an  open-ended  question  was 
not  feasible,  so  instead  respondents  were  asked  How  confident  they  were  that 
their  chosen  job  In  the  Army  was  the  right  one  for  chem. 


Construct/Scale 

Achievement 


Example 


Achievement 

Authority 

Ability 

Utilization 

Safety 

Organizational 

Policy 

Supervision  - 
Human  Resources 
Supervision  - 
Technical 

Comfort 

Activity 

Variety 

Compensation 

Security 

Working  Conditions 

Status 

Advancement 
Recognition 
Social  Status 

Altruism 

Co-workers 

Moral  Values 
Social  Services 

Autonomy 

Responsibility 

Creativity 

Independence 


"Do  work  that  gives  a  feeling  of  accomplishment." 
"Tell  others  what  to  do  on  the  job." 

"Make  full  use  of  your  abilities." 


"A  job  In  which  the  rules  are  not  equal  for  everyone." 
"Have  a  boss  that  supports  the  workers." 

"Learn  the  job  on  your  own." 


"Work  on  a  job  that  keeps  a  person  busy." 
"Do  something  different  most  days  at  work." 
"Earn  less  than  others  do." 

"A  job  with  steady  employment." 

"Have  a  pleasant  place  to  work," 


"Be  able  to  be  promoted  quickly." 

"Receive  awards  or  compliments  on  the  job." 
"A  job  that  does  not  stand  out  from  others." 


"A  job  In  which  other  employees  were  hard  to  get  to 
know." 

"Have  a  job  that  would  not  bother  a  person's  conscience. 
"Serve  others  through  your  work." 


"Have  work  decisions  made  by  others." 
"Try  out  your  own  Ideas  on  the  job." 
"Work  alone." 


Figure  11.18.  Organizational  cllmate/envlronment  constructs,  scales  within 
constructs,  and  an  Item  from  each  scale. 


This  Expressed  Interests  scale  contained  eight  items  which,  as  men¬ 
tioned,  had  three  response  options  that  formed  a  continuum  of  confidence  in 
the  person's  occupational  choice.  Selected  Items  from  this  scale  Include; 
"Before  you  went  to  the  recruiter,  how  certain  were  you  of  the  job  you  wanted 
In  the  Army?",  "If  you  had  the  opportunity  right  now  to  change  your  job  In 
the  Army,  would  you?",  and  "Before  enlisting,  how  long  were  you  Interested  In 
a  particular  Army  job?" 

AVOICE  Revisions  and  Scale  Statistics  Based  on  Pilot  Tests 

Revisions  were  made  in  the  AVOICE  on  the  basis  of  pilot  administrations 
at  Fort  Campbell  and  Fort  Lewis.  Overall,  the  revisions  made  v/ere  far  less 
substantial  for  the  AVOICE  than  for  the  ABLE.  Editorial  review  of  the  inven¬ 
tory,  together  with  the  verbal  feedback  of  Fort  Campbell  soldiers,  resulted 
In  revision  of  15  Items,  primarily  minor  wording  changes.  An  additional  five 
Items  were  modified  because  of  low  Item  correlations  with  the  total  scale 
score  In  the  Fort  Campbell  data.  No  items  were  deleted  based  on  editorial 
review,  verbal  feedback,  or  Item  analyses. 

In  the  Fort  Lewis  pilot  test,  no  revisions  or  deletions  were  made  to  the 
AVOICE  Items.  Item  response  frequencies  were  examined  to  detect  Items  that 
had  relatively  little  discriminatory  power;  that  Is,  three  or  more  of  the 
five  response  choices  received  less  than  105$  endorsement.  There  were  only 
two  such  Items,  and,  upon  examination  of  the  item  content,  it  was  decided  not 
to  revise  these.  Thus,  a  total  of  only  20  AVOICE  Items  were  revised  based  on 
editorial  review  and  pilot  testing. 

At  Fort  Campbell  a  total  of  57  soldiers  completed  the  AVOICE,  with  55 
providing  sufficient  data  for  analysis.  For  the  Fort  Lewis  data  the  re¬ 
sponses  of  4  of  118  soldiers  were  eliminated  for  exceeding  the  missing  data 
criterion  (10%),  resulting  In  an  analysis  sample  size  of  114.  Scale  sta¬ 
tistics  for  this  larger  sample  are  shown  in  Table  11.31.  Reliabilities  are 
again  excellent. 

AVOICE  scale  means  and  standard  deviations  were  also  calculated  sep¬ 
arately  for  males  and  females  and  for  blacks  and  whites,  but  the  sample  sizes 
are  very  small  and  these  data  are  best  viewed  as  exploratory  only.  As  would 
be  expected  on  the  basis  of  previous  research,  there  are  differences  between 
the  sexes  In  mean  score  on  certain  interest  scales  (Table  11.32).  Scales 
such  as  Mechanics  and  Heavy  Construction  show  higher  scores  for  males  than 
females.  On  the  majority  of  the  scales,  however,  the  differences  are  less 
pronounced.  Differences  between  blacks  and  whites  are  quite  snail  and  those 
tables  are  not  shown. 
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AVOICE  Scale  Statistics  for  Total  Group:  Fort  Lewis  Pilot  Test  (N  »  114) 


Mean 


AVOICE  Scale 

No. 

Items 

Mean 

SD 

Item-Total 

Correlation 

Hoyt 

Reliability 

REALISTIC 

Basic  Interest  Item 

1 

3.09 

1.17 

mm 

Mechanics 

16 

53.02 

13.13 

.73 

k94 

Heavy  Construction 

23 

72.57 

15.64 

.62 

.92 

Electronics 

20 

63.94 

16.86 

.73 

.96 

Electronic  Communication 

7 

21.44 

5.73 

.73 

.85 

Drafting 

7 

22.62 

6.11 

.76 

.87 

Law  Enforcement 

16 

50.82 

11.33 

.63 

.89 

Audlograpnlcs 

7 

24.30 

5.12 

.69 

.81 

Agriculture 

5 

15.24 

3.62 

.61 

.58 

Outdoors 

9 

33.09 

6.25 

.62. 

.80 

' Marksman 

S 

16.57 

4.48 

.79 

.84 

Infantry 

10 

31.04 

7.26 

.64 

.84 

Armor/Cannon 

8 

23.46 

6.15 

.67 

.83 

Vehicle  Operator 

10 

-  30.45 

7.10 

.65 

.84 

Adventure 

8 

18.84 

3.60 

.57 

.72 

CONVENTIONAL 

Basic  Interest  Item 

1 

3.00 

,92 

m  m 

mm 

Office  Administration 

16 

45.39 

12.61 

.72 

.94 

Supply  Administration 

13 

36.97 

9.65 

.71 

.92 

Food  Service 

17 

43.46 

10.53 

.59 

.89 

SOCIAL 

Basic  Interest  Item 

1 

3.25 

1.03 

m  m 

m  m 

Teaching/Counseling 

7 

23.61 

5.20 

.71 

.83 

INVESTIGATIVE 

Basic  Interest  Item 

1 

3.09 

.95 

Armm 

Medical  Services 

24 

71.32 

16.65 

.66 

.94 

Mathematics 

5 

15.82 

4.20 

.75 

.80 

Science/Chemical 

11 

30.29 

8.41 

.68 

.88 

Automated  Data  Processing 

7 

24.29 

5.78 

.74 

.86 

ENTERPRISING 

Basic  Interest  Item 

1 

3.11 

1.13 

a  - 

m  _ 

Leadership 

6 

20.71 

4.41 

.72 

.81 

(Continued) 


Table  11.31  (Continued) 

A VO ICE  Scale  Statistics  fop  Total  Group:  Fort  Lewis  Pilot  Test  (N  »  114) 


AVOICE  Seals 


No. 

Items 


ARTISTIC 


Basic  Interest  Item 
Aesthetics 


ORGANIZATIONAL  CLIMATE/ 
ENVIRONMENT  DIMENSIONS 


Achievement 

Safety 

Comfort 

Status 

Altruism 

Autonomy 


6 

6 

10 

6 

6 

6 


EXPRESSED  INTEREST 


8 


Mean 


2.99 

14.73 


21.09 

21.64 

38.50 

21.37 

21.67 

20.46 


15.71 


11-111 


SD 


Mean 

Item-Total 

Correlation 


Hoyt 

Reliability 


1.27 

4.12 


.74 


.79 


2.95 

3.20 

3.83 

2.97 

3.28 

2.33 


3.19 


.59 


.66 


l 


y 


•.’.■a 


'A 


sa. 


Table  II .32 


AVOICE  Means  and  Standard  Deviations  Separately  for  Males  and  Females: 
Fort  Lewis  Pilot  Test 


AVOICE  Scale 
REALISTIC 


Males 
(N  ■  87) 


Mean 


SD 


Females 
(N  •  19) 


Mean 


SD 


Basic  Interest  Item 

3.24 

1.13 

2.35 

1.11 

Mechanics 

54.93 

12.51 

44.05 

12.28 

Heavy  Construction 

75.31 

13.24 

59.70 

19.22 

Electronics 

66.38 

15.95 

52.45 

16.23 

Electronic  Communication 

21.48 

5.73 

21.25 

5.72 

Drafting 

22,97 

6.11 

5.83 

Law  Enforcement 

51.72 

11.41 

46.60 

9.95 

Audiographics 

24.27 

24.45 

5.52 

Agriculture 

15.46 

3.59 

14.20 

3.57 

Outdoors 

33.94 

5.75 

29.10 

6.92 

Marksman 

17.35 

12.90 

4.56 

Infantry 

31.94 

7.14 

26.85 

6.28 

Armor/Cannon 

24.21 

5.99 

19.95 

5.71 

Vehicle  Operator 

31.05 

6.52 

27.60 

8.31 

Adventure 

19.39 

16.32 

3.91 

CONVENTIONAL 


Sasic  Interest  Item 

2.97 

.92 

3.15 

.91 

Office  Administration 

44.91 

11.93 

47.60 

15, 

19 

Supply  Administration 

36.95 

9.56 

37.10 

HI 

09 

Food  Service 

42.54 

9.89 

47.80 

12. 

23 

SOCIAL 


Basic  Interest  Item 
Teaching/Counseling 

INVESTIGATIVE 


3.24 

23,15 


1.05 

5.13 


3.30 

25.75 


.95 

4.97 


Basic  Interest  Item 

3.10 

.95 

3.05 

.97 

Medical  Services 

71.10 

16.65 

72.40 

16.59 

Mathematics 

15.59 

4.31 

16.95 

3.40 

Science/Chemical 

30.99 

8.69 

5.96 

Automated  Data  Processing 

24.20 

5.97 

24.70 

4.76 

(Continued) 
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Table  11.32  (Continued) 


AVOICE  means  and  Standard  Deviations  Separately  for  Males  and  Females: 
Fort  Lewis  Pilot  Test 


Males 

Females 

(N  ■ 

87) 

(N 

-  19) 

AVOICE  Scale 

Mean 

SD 

Mean 

SD 

ENTERPRISING 

Basic  Interest  Item 

3.14 

1.14 

2.95 

1.02 

Leadership 

20.53 

4.61 

21.55 

3.17 

ARTISTIC 

Basic  Interest  Item 

2.96 

1.25 

3.15 

1.31 

Aesthetics 

14.29 

4.22 

16.80 

2.77 

ORGANIZATIONAL  CLIMATE/ 

ENVIRONMENT  DIMENSIONS 

Achievement 

20.97 

2.92 

21.65 

3.02 

Safety 

21.59 

3.36 

21.90 

2.23 

Comfort 

38.26 

3.76 

39.65 

3.97 

Status 

21.22 

3.00 

22.05 

2.73 

Altruism 

21.48 

3.26 

22.55 

3,26 

Autonomy 

20.45 

2.22 

20.55 

2.78 

EXPRESSED  INTEREST 

15.79 

3.34 

15.35 

2.29 

Summary  of  Pilot  Test  Results  for  Non-Cognitlve  Measures 


The  two  non-cognitive  inventories  of  the  Pilot  Trial  Battery,  the  ABLE 
and  the  AVOICE,  are  designed  to  measure  a  total  of  20  constructs  plus  a 
validity  scale  category.  The  ABLE  assesses  six  temperament  constructs  and 
the  Physical  Condition  construct  through  11  scales,  and  also  Includes  4 
validity  scales.  The  AVOICE  measures  six  Holland  Interests  constructs,  six 
Organizational  Environment  constructs,  and  Expressed  Interests  through  31 
scales.  Altogether,  the  46  scales  of  the  inventories  Include  approximately 
600  Items,  291  ABLE  Items  and  306  AVOICE  Items  for  the  Fort  Campbell  ver¬ 
sion,  and  268  ABLE  Items  and  306  AVOICE  items  for  the  Fort  Lewis  version. 


Evaluation  and  revision  of  the  Inventories  took  place  in  three  steps. 
First,  each  was  subjected  to  editorial  review  by  project  staff  prior  to  any 
pilot  testing.  This  review  resulted  In  nearly  200  wording  changes  and  the 
deletion  of  17  items.  The  majority  of  these  changes  applied  to  the  ABLE. 


The  second  stage  of  evaluation  took  place  after  the  Fort  Campbell  pilot 
testing.  Feedback  from  the  soldiers  taking  the  inventory  and  data  analyst  of 
the  results  (e.g.,  Item-total  correlations,  item  response  distributions)  were 
used  to  refine  the  Inventories.  Twenty-three  ABLE  Items  were  deleted  and  173 
ABLE  Items  were  revised;  no  AVOICE  Items  were  deleted  and  20  AVOICF.  Items 
were  revised. 


In  the  third  stage  of  evaluation,  after  the  Fort  Lewis  pilot  testing, 
far  fewer  changes  were  made.  One  ABLE  item  was  deleted,  20  ABLE  Items  were 
revised,  and  no  changes  were  made  to  the  AVOICE.  Throughout  the  evaluation 
process,  It  Is  likely  that  the  AVOICE  was  less  subject  to  revision  because  It 
uses  a  common  response  format  for  all  Items,  whereas  the  response  options  for 
ABLE  Items  differ  by  item. 


The  psychometric  data  obtained  with  both  Inventories  seemed  highly 
satisfactory;  the  scales  were  shown  to  be  reliable  and  appeared  to  be 
measuring  the  constructs  intended.  Sample  sizes  in  these  administrations 
were  fairly  small  (Fort  Campbell  N  ■  62  and  55  for  ABLE  and  AVOICE, 
respectively;  Fort  Lewis  fl  «  106  and  114,  ABLE  and  AVOICE,  respectively),  but 
results  were  similar  In  each  sample. 


Field  testing  of  the  non-cognitive  measures  is  described  In  Section  6. 
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Section  6 


> 

FIELD  TESTS  OF  THE  PILOT  TRIAL  BATTERY  I 


The  previous  sections  have  described  the  development  of  the  Pilot  Trial 
Battery,  the  results  of  three  pilot  tests  at  Forts  Carson,  Campbell,  and 
Lewis,  and  the  revisions  made  on  the  basis  of  the  pilot  data. 

The  final  step  before  the  Concurrent  Validation  was  a  more  systematic 
series  of  field  tests  of  all  the  predictor  measures  using  larger  samples. 
Described  In  this  section  are  the  field  test  samples  and  procedures,  and  a 
variety  of  analyses  of  the  field  test  data.  The  predictor  revisions  made  on 
the  basis  of  these  analyses  are  described  In  Section  7.  The  outcome  of  the 
field  test/revision  process  was  the  final  form  of  the  predictor  battery 
(l.e.,  the  Trial  Battery)  to  be  used  In  the  Concurrent  Validation. 

Field  tests  were  conducted  at  three  sites.  The  sites  and  basic  purposes 
of  the  field  test  at  each  site  are  described  below. 

Fort  Knox  -  The  full  Pilot  Trial  Battery  was  administered  here  to 
evaluate  the  psychometric  characteristics  of  all  the  measures  and  to  analyze 
the  covariance  of  the  measures  with  each  other  and  with  the  ASVAB.  In 
addition,  the  measures  were  re-administered  to  part  of  the  sample  to  provide 
data  for  estimating  the  test-retest  reliability  of  the  measures.  Finally, 
part  of  the  sample  received  practice  on  some  of  the  computer  measures  and 
were  then  retested  to  obtain  an  estimate  of  the  effects  of  practice  on  scores 
on  computer  measures. 

Fort  Bragg  -  The  non-cognltlve  Pilot  Trial  Battery  measures,  Assessment 
of  Background  and  Life  Experiences  (ABLE)  and  Army  Vocational  Interest  Career 
Examination  (AVOICE),  were  administered  to  soldiers  at  Fort  Bragg  under 
several  experimental  conditions  to  estimate  the  extent  to  which  scores  on 
these  Instruments  could  be  altered  or  "faked"  when  persons  are  Instructed  to 
do  so. 

Minneapolis  Military  Entrance  Processing  Station  (MEPS)  -  The  non- 
cognltive  measures  were  administered  to  I  sample  of  recrufts  as  they  were 
being  processed  Into  the  Army,  to  obtain  an  estimate  of  how  persons  In  an 
applicant  setting  might  alter  their  scores. 


Cognitive  Paper-and-PencIl  and  Computer-Administered  Field  Tests 

In  this  subsection  are  described  the  field  tests  of  the  cognitive 
paper-and-pencll  tests  and  the  computer-administered  tests.  These  data  were 
collected  at  Fort  Knox.  No  data  from  the  Fort  Bragg  and  Minneapolis  MEPS 
studies  were  used  in  these  analyses. 


Sample  and  Procedure 

Data  collection  was  scheduled  for  4  weeks  at  Fort  Knox.  During  the 
first  2  weeks,  24  soldiers  were  scheduled  each  day,  for  a  projected  total 
sample  size  of  240.  These  soldiers  were  administered  the  entire  Pilot  Trial 
Battery.  Each  group  assembled  at  0800  hours.  The  testing  sessions  included 
two  15-minute  breaks,  and  1  hour  was  allowed  for  lunch. 

Each  soldier  from  the  first  2  weeks'  sample  reported  back  for  a  half  day 
of  additional  testing,  either  in  the  morning  or  the  afternoon,  exactly  2 
weeks  after  her  or  his  first  session.  Each  Individual  then  completed  one- 
third  of  all  the  paper-and-pencll  tests  (a  retest),  and  completed  either  the 
computer  "practice1'  session  or  the  entire  computer  battery  (a  retest). 

In  the  experiment  on  practice  effects,  practice  was  given  on  five 
tests:  Reaction  Time  2  (Choice  Reaction  Time),  Target  Tracking  1,  Cannon 
Shoot,  Target  Tracking  2,  and  Target  Shoot.  These  tests  were  selected 
because  they  were  thought  to  be  the  tests  that  would  show  greatest  Improve¬ 
ment  with  practice  (all  the  psychomotor  tests  were  Included).  There  were 
three  practice  trials.  For  the  first  two  practice  trials,  unique  Items 
(l.e.,  items  not  appearing  on  the  full  battery  test)  were  used  for  Target 
Tracking  1,  Target  Tracking  2,  and  Cannon  Shoot.  For  the  third  trial,  the 
original  Item  content  was  used. 

Due  to  the  usual  exigencies  of  data  collection  In  the  field,  on  some 
days  fewer  than  24  soldiers  appeared,  and  on  other  days  more  than  24  soldiers 
appeared.  Consequently,  the  actual  sample  sizes  were  as  follows: 

N  293  completed  all  cognitive  and  non-cognltlve  paper-and-pencll  tests 

N  ■  256  completed  computer  tests 

N  *  112-129  completed  retest  of  paper-and-pencll  tests  (N  varied  across 
tests) 

N  =  113  completed  retest  of  computer  tests 

N  =  74  completed  practice  effects  on  computer  tests 

Table  11.33  shows  the  race  and  gender  makeup  for  Fort  Knox  soldiers 
completing  at  least  part  of  the  Pilot  Trial  Battery.  The  mean  age  of  the 
participating  soldiers  was  21.9  years  (SD  =  3.1).  The  mean  years  In  service, 
computed  from  date  of  entry  Into  the  Army,  was  1.6  years  (SD  =  0.9). 


Table  11.33 


Race  and  Gender  of  the  Fort  Knox  Field  Test  Sample  for  the 
Pilot  Trial  Battery 


Race 

Frequency 

White 

156 

Hispanic 

24 

Black 

121 

Native  American 

2 

Total 

Sex 

Frequency 

Female 

57 

Hale 

245 

Descriptive  Statistics 

Table  11.34  shov/s  the  means,  standard  deviations,  and  reliabilities  of 
the  cognitive  paper -and-pencll  tests  in  the  Pilot  Trial  Battery.  The  means 
and  standard  deviations  Indicate  that  the  tests  are  at  about  the  desired 
level  of  difficulty,  with  the  possible  exception  of  Orientation  Test  3  which 
appeared  somewhat  more  difficult  than  desirable.  Internal  consistency  re¬ 
liability  estimates  were  relatively  high,  v/lth  the  exception  of  Reasoning 
Test  2  (.63). 

The  Fort  Lewis  split-half  coefficients,  which  are  based  on  separately 
timed  halves,  provide  an  appropriate  estimate  of  Internal  consistency  for 
speeded  tests.  The  Interval  for  test-retest  was  2  weeks.  Reasoning  was 
again  the  least  reliable.  Table  11.35  shows  gain  scores  that  were  higher 
than  initially  expected  on  Object  Rotation,  Shapes,  Path,  and  Orientation  1 
tests.  Inspection  of  the  last  two  columns  In  Table  11.35  Indicated  that  much 
of  the  gain  probably  occurred  because  the  soldiers  attempted  more  Items  the 
second  time  they  took  the  test.  This  Is  certainly  to  be  expected  since  the 
subjects  would  be  more  familiar  with  types  and  Instructions. 


Means.  Standard  Deviations,  and  Reliability  Estimates  for  the  Fort  Knox 
Field  Test  of  the  Ten  Pa per- and- Pencil  Cognitive  Tests 
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Pilot  Test  Battery  for  Persons  Taking  Tests  at  Both  Tine  1  and  Tli 
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Table  11.36  presents  the  descriptive  statistics  for  19  scores  derived 

from  the  computer-administered  measures.  In  general,  the  cognitive/ 
perceptual  tests  (except  for  Cannon  Shoot)  yield  two  types  of  scores: 
accuracy  and  speed.  In  addition,  two  derived  measures  can  be  computed:  the 
slope  and  Intercept  obtained  when  reaction  times  are  regressed  against  a 
relevant  parameter  of  test  Items.  For  Perceptual  Speed  and  Accuracy,  such  a 
parameter  was  the  number  of  stimuli  being  compared  in  an  Item  (l.e,,  two, 
five,  or  nine  objects).  Recall  that  the  slope  represents  the  average  In¬ 
crease  In  reaction  time  with  an  Increase  of  one  object  In  the  stimulus  set; 
the  lower  the  value,  the  faster  the  comparison.  The  intercept  represents  all 
other  processes  not  Involved  in  comparing  stimuli,  such  as  encoding  the 
stimuli  and  executing  the  response. 

Reaction  times  on  all  tests  were  computed  only  for  correct  responses, 
and  the  development  strategy  was  to  construct  items  so  that  every  Item  could 
be  answered  correctly,  given  enough  time.  Consequently,  the  speed  measures 
(reaction  time)  were  expected  to  have  more  variance  and  be  more  meaningful 
than  the  accuracy  measures. 

Analyses  of  Fort  Knox  data  Indicated  that  total  reaction  time  and  deci¬ 
sion  time  were  very  highly  correlated  and,  since  movement  time  Is  conceptu¬ 
ally  uninteresting,  we  elected  to  use  total  reaction  time  for  all  tests. 
There  are  a  number  of  ways  to  score  reaction  time,  and  for  the  various 
alternatives  the  score  distributions,  Intercorrelations,  and  reliabilities 
were  examined.  There  were  no  striking  differences  between  methods,  and 
untrimmed  means  were  used  for  all  tests  except  Simple  and  Choice  Reaction 
Times;  because  of  fewer  Items,  extreme  scores  could  affect  the  mean  much  more 
for  these  two  tests  than  for  the  others.  To  deal  with  the  problem  of  missing 
data,  cases  with  more  than  105!  missing  were  eliminated. 

Procedures  for  scoring  the  Cannon  Shoot  Test  differed  from  those  used  to 
score  the  other  cognitive/perceptual  tests,  and  a  reaction  time  score  Is 
Inappropriate  because  the  task  requires  the  subject  to  ascertain  the  optimal 
time  to  fire  to  ensure  a  direct  hit  on  the  target.  Therefore,  responses 
were  scored  by  computing  a  deviation  score  composed  of  the  difference  between 
the  time  the  subject  fired  and  the  optimal  time  to  fire.  These  scores  are 
summed  across  all  Items  and  the  mean  deviation  time  is  computed. 

For  two  of  the  three  psychomotor  tests,  Target  Tracking  1  and  2,  the 

distance  from  the  center  of  the  crosshairs  to  the  center  of  the  target  was 
computed  approximately  16  times  per  second,  or  almost  350  tines  per  trial. 
These  distances  were  then  averaged  by  the  computer  to  generate  the  mean  dis¬ 
tance  for  each  trial.  However,  the  frequency  distribution  of  these  scores 
proved  to  be  highly  positively  skewed  and  they  were  transformed  using  the 
natural  logarithm  transformation.  The  overall  test  score  for  each  subject 
wa*  then  the  mean  of  the  log  mean  distance  across  trials. 

Scoring  of  the  Target  Shoot  Test  was  a  bit  more  complicated.  Throe 

overall  test  scores  were  generated  for  each  subject:  (a)  the  percentage  of 
hits;  (b)  the  mean  distance  from  the  center  of  the  crosshairs  to  the  center 
of  the  target  at  the  time  of  firing  (the  distance  score);  and  (c)  the  mean 
time  elapsed  from  the  start  of  the  trial  until  firing  (the  time-to-fire 
score) . 


Table  11.36 


Characteristics  of  the  19  Dependent  Measures  for  Computer- 
Administered  Tests:  Fort  Knox  Field  Tests  (N  -  256 )a 


Dependent  Measure 

Mean 

SD 

Rel lability 

Split-Half  Test-Retest 

(r,h)b  (rtt)b 

PERCEPTUAL 

Simple  Reaction  Time  (SRT) 

Mean  Reaction  Time  (RT) 

Choice  Reaction  Time  (CRT) 

56.2  hsc 

18.8  hs 

.90 

.37 

Mean  Reaction  Time  (RT)  67,4  hs 

Perceptual  Speed  and  Accuracy  (PS&A) 

10.2  hs 

.89 

.56 

Percent  Correct  (PC) 

m 

8* 

.83 

.59 

Mean  Reaction  Time  (RT) 

325.6  hs 

70.4  hs 

.96 

.65 

Slope 

42.7  hs/chd 

15.6  hs/ch 

.88 

.67 

Intercept 

68.0  hs 

45.0  hs 

.74 

.55 

Target  Identification 

Percent  Correct  (PC) 

90* 

10* 

.84 

.19 

Mean  Reaction  Time  (RT) 

528.7  hs 

134.0  hs 

.96 

.67 

Short-Term  Memory  (STM) 

Percent  Correct  (PC) 

Mean  Reaction  Time  (RT) 

85* 

8* 

.72 

.34 

129.7  hs 

23.8  hs 

.94 

.78 

Slope 

7.2  hs/ch 

4.5  hs/ch 

.52 

.47 

Intercept 

108.1  hs 

23.2  hs 

.84' 

.74 

Number  Memory 

Percent  Correct  (PC) 

83* 

13* 

.63 

.53 

Mean  Operation  Time  (RT) 

230.7  hs 

73.9  hs 

.95 

.88 

Cannon  Shoot 

Time  Error  (TE) 

78,6  hs 

20.3  hs 

.88 

.66 

PSYCHOMOTOR 

Target  Track  1 

Mean  Log  Distance 

3.2 

.44 

.97 

.68 

Target  Shoot 

Mean  Time  to  Fire  (std)  (TF) 

-.01 

.48 

.91 

.48 

Mean  Log  Distance  (std) 

-.01 

.41 

.86 

.58 

Target  Track  2 

Mean  Log  Distance 

3.91 

.49 

.97 

.68 

a  t!  varies  slightly  from  test  to  test. 

b  N  «  120  for  test-retest  reliabilities,  but  varies  slightly  from  test  to  test.  rsh  ■ 
split-half  reliability;  odd-even  Item  correlation  with  Spearman-Brown  correction, 
rtt  ■  test-retest  reliability,  2-week  Interval  between  administrations. 
c  hs  «  hundredths  of  a  second, 
d  hs/ch  ■  hundredths  of  a  second  per  character. 
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Percentage  of  hits  was  less  desirable  as  a  measure  because  It  contains 
relatively  little  Information  compared  to  the  distance  measure. 
Complications  arose  because  subjects  received  no  distance  or  tlme-to-flre 
scores  on  trials  where  they  failed  to  fire  at  the  target  before  the  time 
limit  for  the  trial  elapsed.  Consequently,  the  distance  and  tlme-to-flre 
scores  for  each  trial  were  standardized  and  the  overall  distance  and  time 
score  was  then  computed  by  averaging  these  standardized  scores  across  all 
trials  In  which  the  subject  fired  at  the  target. 

In  Table  11.36  the  split-half  reliabilities  are  odd-even  correlations 
corrected  to  full  test  length,  but  note  that  they  do  not  suffer  from  the 
artificial  Inflation  that  speeded  paper-and-pencll  measures  do.  This  Is 
because  all  Items  are  completed  by  every  subject. 

The  test-retest  reliabilities  are  lower  than  the  split-half  reliabili¬ 
ties  and  three  of  them  are  very  low.  However,  two  of  them,  the  percent- 
correct  scores,  are  not  the  primary  score  for  their  respective  tests,  and 
Simple  Reaction  Time  Is  viewed  largely  as  a  "warm-up"  test. 


Special  Analyses  on  Computer-Administered  Tests 

Correlations  With  Video  Game-Playing  Experience.  Field  test  subjects 
were  asked  the  question,  "In  the  last  couple  years,  how  much  have  you  played 
video  games?"  There  were  five  possible  alternatives,  ranging  from  "You  have 
never  played  video  games"  to  "You  have  played  video  games  almost  every  day." 
The  five  alternatives  were  given  scores  of  1  to  5,  respectively.  The  mean 
was  2.99,  SD  was  1.03  (N  ■  256),  and  the  test-retest  reliability  was  .71  (N  ■ 
113). 

The  19  correlations  of  this  Item  with  the  computer  test  scores  ranged 
from  -.01  to  +.27,  with  a  mean  of  .10.  A  correlation  of  .12  Is  significant 
at  alpha  ■  .05.  We  Interpret  these  findings  as  showing  a  small,  hut 
significant,  relationship  of  video  game-playing  experience  to  the  more 
"game-like"  tests  In  the  battery. 

Effects  of  "Machine"  or  Computer  Testing  Station  Differences.  We 
re peated  the  Investigation  done  at  the  pilot  test  at  Fort  Lewis  of  the  e f f ec t 
of  machine  or  computer  testing  station  differences  on  computer  test  scores. 
There  were  six  computer  testing  stations,  and  approximately  40  male  soldiers 
had  been  tested  at  each  station,  (V/e  used  only  males  in  this  analysis  to 
avoid  confounding  the  results  with  possible  sex  differences.)  We  ran  a 
multivariate  analysis  of  variance  (MANCVA)  for  the  19  computer  test  scores, 
with  six  "machine"  levels.  Machine  differences  again  had  no  effect  on  test 
scores. 

Practice  Effects  on  Selected  Computer  Test  Scores.  Table  11.37  shows 
the  results  of  the  analyses  of  variance  for  the  five  tests  Included  In  the 
practice  effects  research.  These  results  show  only  one  statistically 
significant  practice  effect,  the  Mean  Log  HI  stance  score  on  Target  Tracking 
Test  2.  There  were  three  statistically  significant  findings  for  time, 
indicating  that  scores  did  change  with  a  second  testing,  whether  or  not 
practice  trials  intervened  between  the  two  tests.  Finally,  note  that  the 
Omega  squared  value  Indicates  that  relatively  small  amounts  of  test  score 
variance  are  accounted  for  hy  the  Group,  Time,  or  Time  by  Group  factors. 
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Table  11.37 


Effects  of  Practice  on 

Selected  Computer  Test  Scores 

Dependent 

Source  of 

Omega 

Test 

Measure 

Variance 

df 

F 

Square* 

• 

Choice  Reaction  Time 

Trimmed  Mean 

Group 

1,180 

9.71* 

.032 

Reaction  Time 

Time 

1,180 

25.70* 

.035 

Time  x  Group 

1,180 

.73 

-- 

Target  Tracking  1 

Mean  Log  Distance 

Group 

1,178 

.73 

-- 

Time 

1,178 

9.26* 

.005 

Time  x  Group 

1,178 

4.11 

— 

Target  Tracking  2 

Mean  Log  Distance 

Group 

1,178 

.47 

.  . 

Time 

1,178 

1.30 

-  - 

Time  x  Group 

1,178 

7.79* 

.005 

Cannon  Shoot 

Time  Error 

Group 

1,171 

3.79 

.  - 

Time 

1,171 

.16 

-- 

Time  x  Group 

1,171 

5.72 

Ml  • 

Target  Shoot 

Mean  Log  Distance 

Group 

1,171 

.41  ‘ 

.  - 

Time 

1,171 

9.28* 

.012 

Time  x  Group 

1,171 

.08 

-- 

♦Denotes  significance  at  £<.01, 


These  data  suggest  that  the  practice  Intervention  was  not  a  particularly 
strong  one.  It  should  be  noted,  though,  that  on  some  tests  subjects' 
performance  actually  deteriorated  from  Time  1  to  Time  2.  The  average  gain 
score  for  the  two  groups  across  the  five  dependent  measures  was  only  .09 
standard  deviation.  This  suggests  either  that  the  tasks  used  In  these  tests 
are  resistant  to  practice  effects,  or  that  performance  on  these  tasks  reaches 
a  maximum  level  of  proficiency  after  only  a  few  trials.  Also,  recall  that 
analyses  of  the  Pilot  Trial  Battery  cognitive  paper-and-pencll  tests 
Table  11.35)  showed  gain  scores  that  were  as  high  or  higher  than  those 
here.  Perhaps  gain  In  scores  through  retesting  or  practice 
concern  for  computerized  tests  than  for  paper-and-pencll 


Is 
tests. 


of 


(see 
found 
even  less 


In  summary,  data  from  the  practice  experiment  Indicate  that  scores 
computerized  psychomotor  tests  appear  to  be  quite  stable  over  a  2 
period.  Practice  does  have  some  effect 
relatively  small.  Certainly  It  does 


from 
week 

on  test  scores,  but  It  appears  to  be 
not  seem  strong  enough  to  warrant 


serious  concern  about,  the  usefulness  of  the  tests, 


11-123 


(J 


Intercorrelations  of  Cognitive  Paper-and-Pencil  Tests,  Computer “Administered 
Tests,  and  A$Va6~  Subtests 

Table  11.38  contains  the  Intercorrelations  for  the  ASVAB  subtests, 
paper-and-pencil  cognitive  measure,  and  computer-administered  tests,  which 
Include  both  cognitive/perceptual  and  psychomotor  measures.  Note  that  we 
have  also  Included  scores  on  the  AFQT.  These  correlations  are  based  on  the 
Fort  Knox  field  test  sample,  but  Include  only  those  subjects  with  test  scores 
available  on  all  variables  (N  »  166). 

In  examining  these  relationships,  we  first  looked  at  the  correlations 
between  tests  within  the  same  battery.  For  example,  correlations  between 
ASVAB  subtest  scores  range  from  .02  to  .74  (absolute  values).  The  range  of 
Intercorrelations  Is  a  bit  more  restricted  when  examining  the  relationships 
between  the  cognitive  paper-and-pencil  test  scores  (.27  to  .67).  This  range 
of  values  reflects  the  fact  that  the  PTB  measures  were  designed  to  tap  fairly 
similar  cognitive  constructs.  Intercorrelatlons  for  the  cognitive/perceptual 
computer  test  scores  range  from  .00  to  .63  in  absolute  terms.  Note  that  the 
hlgnest  values  appear  for  correlations  between  scores  computed  from  the  sane 
test.  Intercorrelatlons  between  psychomotor  variables  range  from  .15  to  .61 
In  absolute  terms. 

Perhaps  the  most  Important  question  to  consider  Is  the  overlap  between 
the  different  groups  of  measures.  Do  the  paper-and-pencil  measures  and 
computer  tests  correlate  highly  with  the  ASVAB  or  are  they  measuring  unique 
or  different  abilities?  Note  that  across  all  PTB  paper-and-pencil  tests, 
ASVAE  Mechanical  Comprehension  appears  to  correlate  the  highest  with  the  new 
tests.  Across  all  ASVAB  subtests,  Orientation  Test  3  yields  the  highest 
correlations. 

Table  11.35  summarizes  the  correlational  data  In  Table  11.38.  The  two 
tables  lead  to  the  conclusion  that  the  various  types  of  measures  do  not 
overlap  excessively,  and  therefore  do  appear  to  each  make  separate 
contributions  to  ability  measurement. 


Factor  Analysis  Results 

In  addition  to  examining  the  intercorrelations  among  all  the 
cognitive/perceptual  measures  and  psychomotor  measures,  we  also  examined 
results  from  a  factor  analysis,  Two  variables,  PSfiA  reaction  time  and 
Short-Term  Memory  reaction  time,  were  omitted  because  the  reaction  time 
scores  from  these  measures  correlated  very  highly  with  their  corresponding 
slope  or  intercept  variables.  To  avoid  obtaining  con, mnall ties  greater  than 
one,  they  were  omitted.  Results  from  the  seven-factor  solution  of  a 
principal  components  factor  analysis  with  varimax  rotation  are  displayed  in 
Table  11.40.  All  loadings  of  .30  or  greater  are  shown. 

Factor  1  Includes  eight  of  the  ASVAB  subtests,  six  of  the  paper-and- 
pencil  measures,  and  two  cognitive/perceptual  computer  variables. 
Because  this  factor  contains  measures  of  verbal,  numerical,  and 
reasoning  ability,  v/c  have  termed  this  "g." 
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Principal  Components  Factor  Analysis  of  Scores  of  the  ASVAB  Subtests, 
Cognitive  Paper-and- Pencil  Measures,  and  Cognitive/Perceptual  and 
Psychomotor  Computer-Administered  Tests*  (N  ■  168) 


Variable  Factor  1  Factor  2  Factor  3  Factor  4  Factor  5  Factor  S  Factor  7  Ji? 


ASVAB 


GS 

75 

AR 

75 

UK 

77 

PC 

62 

NO 

cs 

AS 

62 

MK 

77 

MC 

63 

El 

72 

COSNITIVC  PAPER- 
AfID-PENC  I L 


59 

73 

62 

47 

84  77 

62  44 

58 
70 
68 
65 


Assemb  ObJ 

35 

69 

Obj  Rotation 

-61 

Shapes 

66 

Mate 

70 

Path 

67 

Reason  1 

37 

58 

Reason  2 

37 

47 

Orient  1 

37 

64 

Orient  2 

40 

46 

Orient  3 

60 

52 

PERCEPTUAL 

COMPUTER 


66 

49 

51 
67 
65 
54 
44 
58 

52 
67 


SRT-RT 

63 

44 

CRT-RT 

61 

50 

PS&A-PC 

67 

31 

70 

PS&A  Slope 

88 

81 

PS&A  Inter 

•  65 

50 

74 

Target  ID-PC 

40 

25 

Target  ID-RT 

-41  37 

30 

57 

STM-PC 

3° 

34 

41 

STM-S1  ope 

41 

25 

STM-Int 

38 

51 

47 

Cannon  Shoot-TE 

32 

19 

Ho  Mem- PC 
No  Mem-RT 


53 

■37 


37 

-46 


52 

54 


PSYCHOMOTOR 

COMPUTER 

Tracking  1 
Tracking  2 
Target  Shoot-TF 
Target  Shoct-Olst 

Variance 

Explained  5.69  4.70 


06 

77 

64 

2.83 


2.37 


1.92 


42 

1.87 


82 

66 

23 

48 


1.17 


Note :  Decimals  have  been  omitted  from  factor  loadings. 

a  Mote  that  the  following  variables  were  not  Included  In  this  factor  analysis:  AFOT,  PS4A' 
Reaction  Time,  and  Short-Term  Memory  Reaction  Time. 

h2  -  commonality  (sum  of  squared  factor  loadings)  for  variables. 


Factor  2  includes  all  of  the  PTB  cognitive  paper-and-pencil  measures, 
Mechanical  Comprehension  f^om  the  ASVAB,  and  Target  Identification 

reaction  time  from  the  computer  tests.  We  called  this  a  general  spatial 
factor. 

Factor  3  has  major  loadings  on  the  three  psychomotor  tests,  with 
substantially  smaller  loadings  from  three  cognitive/perceptual  computer 
test  variables,  the  Path  Test,  and  Mechanical  Comprehension  from  the 
ASVAB.  Given  the  high  loadings  of  the  psychomotor  tests  on  this  factor, 
we  refer  to  this  as  the  motor  factor. 

Factor  4  Includes  variables  from  the  cognltlve/pcrceptual  computer 
tests.  This  factor  appears  to  Involve  accuracy  of  perception  across 
several  tasks  and  types  of  stimuli. 

Factor  5  Is  not  that  clear,  but  the  highest  loadings  are  on  straightfor¬ 
ward  reaction  time  measures,  so  we  Interpret  this  as  a  speed  of  reaction 
factor. 

Factor  6  contains  four  variables,  two  from  the  ASVAB  and  two  from  the 
cognitive/perceptual  computer  tests.  This  factor  appears  to  represent 
both  speed  of  reaction  and  arithmetic  ability. 

Factor  7  contains  three  variables  from  the  computer  tests.  These 
Include  Short-Term  Memory  percent  correct  and  slope,  and  Target  Shoot 
tlme-to-flre.  This  factor  Is  difficult  to  Interpret,  but  we  believe  It 
may  represent  a  response  style  factor.  That  Is,  this  factor  suggests 
that  those  individuals  who  take  a  longer  time  to  fire  on  the  Target 
Shoot  Test  also  tend  to  have  higher  slopes  on  the  Short-Term  Memory 
(lower  processing  speeds  with  Increased  bits  of  Information)  but  are 
more  accurate  or  obtain  higher  percent-correct  values  on  the  Short-Term 
Memory  test. 

Note  that  several  variables  have  fairly  low  communal Itles.  These  may  be 
due  to  relatively  low  score  variance  or  reliability,  but  It  could  also  be  due 
tc  these  variables  having  unique  variance,  at  least  when  factor  analyzed  with 
this  set  of  tests.  We  think  this  latter  explanation  Is  highly  plausible  for 
the  Cannon  Shoot  score. 


Field  Test  of  Non-Cognltlve  Measures  (ABLE  and  AVOICE) 

In  this  section  are  described  the  field  test  results  for  the 
non-cognltlve  predictor  measures,  Including  the  descriptive  scale  statistics 
and  the  results  of  the  fakablllty  analyses.  The  samples  were  different  for 
the  different  analyses,  and  each  Is  described  In  turn. 


Scale  Analyses 

These  analyses  were  performed  to  obtain  descriptive  scale  statistics  and 
examine  the  covariation  among  scales.  Only  the  Fort  Knox  data  were  used. 


Sample ,  At  Fort  Knox  a  total  of  290  soldiers  completed  the  ACLE  and  287 
soldiers  the  AVCICE.  After  deletion  of  inventories  with  greater  than  10r< 
missing  data  for  both  measures,  and  of  those  ABLEs  where  scores  on  the 
Non -Random  Response  Scale  were  less  than  six,  a  total  of  276  APLEs  and  270 
AVOICEs  were  available  for  analysis.  For  the  experiment  in  which  portions  of 
the  Pilot  Trial  Battery  were  re-administered  to  soldiers  2  weeks  after  the 
first  administration,  the  total  number  of  "Time  2"  ABLE  and  AVCICE  Inven¬ 
tories,  after  the  data  quality  screens  had  been  applied,  was  109  and  127, 
respectively. 

Results.  Summary  statistics  for  the  ABLE  and  AVOICE  are  presented  In 
TableTTim,  11.42  (ABLE),  and  Table  11.43  (AVOICE).  As  can  be  seen,  the 
median  alpha  coefficient  (Internal  consistency)  for  the  ABLE  content  scales 
Is  .84,  and  the  median  test-retest  correlation  for  the  ABLE  content  scales  Is 
.79,  with  a  range  of  .68  to  .83.  At  retest  or  second  testing,  the  soldiers 
apparently  responded  In  a  somewhat  more  random  way.  The  response  validity 
scale,  Non-Random  Responses,  detected  it  and,  consequently,  tne  correlation 
between  first  and  second  testing  was  low,  .37.  The  median  alpha  coefficient 
(Internal  consistency)  for  the  AVOICE  scales  Is  .86.  The  median  test-retest 
correlation  for  the  AVOICE  scales  is  .70. 

The  ABLE  content  scales  and  the  AVOICE  scales  were  separately  factor 
analyzed,  and  in  both  cases  the  two-factor  solution  appeared  to  best 
summarize  the  data.  As  shown  In  Tables  11.44  (for  ABLE)  and  11,45  (for 
AVCICE),  the  temperament  factors  were  labeled  Personal  Impact  and 
Dependability,  and  the  Interest  factors  were  labeled  Combat-Related  and 
Combat  Support. 


Fakablllty  Analyses 


Recall  that  there  were  four  validity  scales  on  the  ABLE:  Non-Randorn 
Responses,  Unlikely  Virtues  (Social  Desirability),  Poor  Impression,  and 
Self-Knowledge.  To  Investigate  Intentional  distortion  of  responses,  data 
were  gathered  (a)  from  soldiers  Instructed,  at  different  times,  to  distort 
their  responses  or  to  be  honest  (experimental  data  gathered  at  Fort  Bragg ) ; 
(b)  from  soldiers  who  were  simply  responding  to  the  ABLE  and  AVOICE  with  no 
particular  directions  (data  gathered  at  Fort  Knox);  and  (c)  from  recently 
sworn-in  Army  recruits  at  the  Minneapolis  Military  Entrance  Processing 
Station  (MEPS). 


The  purposes  of  the  faking  study  were  to: 

•  Determine  the  extent  to  which  soldiers  can  distort  their  responses 
to  temperament  and  interest  inventories  when  instructed  to  do  so. 
(Compare  data  from  Fort  Dragg  faking  conditions  with  Fort  Bragg  and 
Fort  Knox  honest  conditions.) 


•  Determine  the  extent  to  which  the  ABLE  response  validity  scales 
detect  such  intentional  distortion.  (Compare  response  validity 
scales  in  Fort  Bragg  honest  and  faking  conditions.) 

•  Determine  the  extent  to  which  ABLE  validity  scales  can  be  used  to 
correct  or  adjust  scores  for  intentional  distortion. 
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Table  11.42 


ABLE  Test-Retest  Results*:  Fort  Knox  Field  Test 


Scale 

Mean 

Time  1 
(N  *  276) 

Mean 
Time  2 
(N  -109) 

Effect  S1zeb 

Content  Scales 

Emotional  Stability 

64.9 

65.1 

.02 

Sel f-Esteem 

35.1 

34.8 

-.05 

Cooperativeness 

54.1 

54.3 

.04 

Conscientiousness 

48.9 

48.3 

-.10 

Nondelinquency 

55.4 

55.6 

.02 

Traditional  Values 

37.2 

37.9 

.15 

Work  Orientation 

61.2  . 

60.7 

-.07 

Internal  Control 

50.3 

50.2 

-.01 

Energy  Level 

57.1 

57.0 

-.01 

Dominance 

35.5 

34.9 

-.09 

Physical  Condition 

31.1 

30.4 

-.09 

Response  Validity  Scales 

Uni ikel  y  Virtues 

16.6 

17.5 

.27 

Sel f- Knowledge 

29.6 

29.0 

-.18 

Non-Random  Response^ 

7. 7 

7.2 

-.65 

Poor  Impression 

1 

1.2 

-.18 

aTest-Retest  Interval  was  two  weeks. 

bEffect  Size  ■  (Mean  Time  1  -  Mean  Time  2 ) / Dp  Time  1. 

cBased  on  sample  edited  for  missing  data  only;  Nj  «  281  and  N?  -  121. 
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A VO ICE  Scale  Score  Characteristics:  Fort  Knox  Field  Test  (N  =  270  except  where 
otherwise  noted) 
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Table  11.44 


1 


ABLE  Factor  Anal ys Isa;  Fort  Knox  Field  Test 


Personal  Impact 


II 

Dependabll  Itv 


Self-Esteem 
Energy  Level 
Dominance  (Leadership) 
Emotional  Stability 
Work  Orientation 
Nondel Inquency 
Traditional  Values 
Conscientiousness 
Cooperativeness 
Internal  Control 


Note:  h2  ■  communal Ity,  the  sum  of  squared  factor  loadings  for  a  variable, 
aprlnclpal  factor  analysis,  varlmax  rotation. 
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AVOICE  Factor  Analysis*:  Fort  Knox  Field  Test 


I 


Combat 


II 

Combat- 


Scale  Support^  Relatedc  h2  n 

Office  Administration 

.85 

■.13 

.73  A 

Supply  Administration 

.78 

.11 

.62  § 

Teaching/Counseling 

.76 

.11 

.59  0 

Mathematics 

.74 

.09 

.55  2 

Medical  Services 

.73 

.18 

.57  f 

Automated  Data  Processing 

.71 

.10 

.51  ft 

Audiographics 

.64 

.17 

.44  6 

Electronic  Communication 

.64 

.36 

.54  ft 

Science/Chemical  Operations 

.61 

.43 

.55  0 

Aesthetics 

.61 

.04 

.37  S 

Leadership 

.58 

.35 

.46  | 

Food  Service 

.54 

.19 

.33  9 

Drafting 

.54 

,.34, 

.41  ,  S 

Infantry 

7T5 

.85 

.74  ft 

Armor/Cannon 

.13 

.84 

.73 

Heavy  Construction/Combat 

.17 

.84 

.73  ft 

Outdoors 

.02 

.74 

.55  | 

Mechanics 

.17 

.74 

.58  ft 

Marksman 

.05 

.73 

.54  ij 

Vehicle/Equipment  Operator 

.17 

.73 

.56  £ 

Agriculture 

.18 

.64 

.44  5 

Law  Enforcement 

.27 

.61 

.44  H 

El ectronlcs 

.45 

.57 

.52  F 

io 

Note:  h2  =  communal ity,  the  sum  of  squared  factor  loadings  for  a  variable. 
*  Principal  factor  analysis,  varlmax  rotation. 

b  Conventional,  Social,  Investigative,  Enterprising,  Artistic  constructs. 
c  Real Istlc  construct. 
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•  Determine  the  extent  to  which  distortion  is  a  problem  in  an 

applicant  setting.  (Compare  HEPS  data  with  Fort  Bragg  and  Fort 
Knox  data.) 


Subjects.  The  participants  In  the  experimental  group  were  425  enlisted 
soldiers  in  the  82nd  Airborne  brigade  at  Fort  Bragg.  Comparison  samples  were 
MEPS  candidates  (N  -  126)  and  the  Fort  Knox  soldiers  described  earlier  (N  = 
276). 


Procedure  and  Design, 
created 


Four  faking  and  two  honest  conditions  were 


Fake  Good  on  the  ABLE 
Fake  Bad  on  the  ABLE 
Fake  Combat  on  the  AVOICE 
Fake  Non-combat  on  the  AVOICE 
Honest  on  the  A8LE 
Honest  on  the  AVOICE 


The  significant  parts  of  the  instructions  for  the  six  conditions  were  as 
follows: 


ABLE  -  Fake  Good 

Imagine  you  are  at  the  Military  Entrance  Processing  Station  (MEPS) 
and  you  want  to  Join  the  Army.  Describe  yourself  In  a  way  that  you 
think  will  ensure  that  the  Army  selects  you. 


ABLE  -  Fake  Bad 

Imagine  you  are  at  the  Military  Entrance  Processing  Station  (MEPS) 
and  you  do  not  want  to  join  the  Army.  Describe  yourself  In  a  way 
that  you  think  will  ensure  that  the  Army  does  not  select  you. 


ABLE  -  Honest 

You  are  to  describe  yourself  as  you  really  are. 


AVOICE  -  Fake  Combat 

Imagine  you  are  at  the  Military  Entrance  Processing  Station  (MEPS). 
Please  describe  yourself  In  a  way  that  you  think  will  ensure  that  you 
are  placed  In  an  occupation  In  which  you  are  likely  to  be  exposed  to 
combat  during  a  wartime  situation. 


AVOICE  -  Fake  Non-combat 


Imagine  you  are  at  the  Military  Entrance  Processing  Station  (MEPS). 
Please  describe  yourself  In  a  way  you  think  will  ensure  that  you  are 
placed  In  an  occupation  In  which  you  are  uni  1 kely  to  be  exposed  to 
combat  during  a  wartime  situation. 


AVOICE  -  Honest 

You  are  to  describe  yourself  as  you  really  are. 


1 1- 1.35 
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The  design  was  a  2x2x2  with  repeated  measures  on  faking  and  honest 

conditions  which  were  counterbalanced.  Thus,  approximately  half  Uie 
experimental  group,  124  soldiers,  completed  the  inventories  honestly  in  the 
morning  and  faked  in  the  afternoon,  while  the  other  half  (121)  completed  the 
Inventories  honestly  In  the  afternoon  and  faked  in  the  morning.  The  first 
between-subjects  factor  consisted  of  these  two  levels:  Fake  Good/Want  Combat 
and  Fake  Bad/Do  Mot  Want  Combat.  Order  was  manipulated  in  the  second 
between-subjects  factor  such  that  the  following  two  levels  were  produced: 
Faked  responses  then  honest  responses,  and  honest  responses  then  faked 
responses. 

Faking  Study  Results  -  Temperament  Inventory.  A  multivariate  analysis 
of  variance  (MAN OVA)  on  the  Fort  Bragg  data  on  the  temperament  Inventory 
shows  that  all  the  Fake  by  Set  Interactions  are  significant,  Indicating  that 
soldiers  can,  when  Instructed  to  do  so,  distort  their  responses.  The  order 
of  conditions  In  which  the  participant  completed  the  ABLE  also  affected  the 
results.  Table  11.46  shows  the  mean  scores  for  the  various  experimental 
conditions. 

Another  research  question  was  the  extent  to  which  the  response  validity 
scales  detected  Intentional  distortion.  As  can  be  seen  In  Table  11.47,  the 
response  validity  scale  Social  Desirability  (Unlikely  Virtues)  detects  Faking 
Good  on  the  ABLE;  the  response  validity  scales  Mon-Random  Response,  Poor 
Impression,  and  Self-Knowledge  detect  Faking  Bod.  According  to  these  data, 
the  soldiers  responded  more  randomly,  created  a  poorer  Impression,  and 
reported  that  they  knew  themselves  less  well  when  told  to  describe  themselves 
In  a  way  that  would  Increase  the  likelihood  that  they  would  not  be  accepted 
into  the  Army. 

We  also  examined  the  extent  to  which  the  response  validity  scales  Social 
Desirability  and  Poor  Impression  could  be  used  to  adjust  ABLE  content  scale 
scores  for  Faking  Good  and  Faking  Bad.  Social  Desirability  was  partial!  ed 
from  the  content  scales  In  the  Fake  Good  condition  and  Poor  Impression  from 
the  content  scales  In  the  Fake  3ad  condition. 

Table  11.48  shows  the  adjusted  mean  differences,  which  clearly  show  that 
these  response  validity  scales  can  be  used  to  adjust  content  scales. 
However,  two  important  unknowns  remain:  Do  the  adjustment  formulas  developed 
on  these  data  cross-val idate ,  and  do  they  increase  criterion-related 
validity? 

As  noted  earlier,  in  an  effort  to  explore  the  extent  to  which 
Intentional  distortion  may  be  a  problem  in  an  applicant  setting,  the  ABLE  and 
AVO ICE  were  administered  at  the  Minneapolis  MEPS.  However,  the  sample  of  126 
recruits  who  completed  the  inventories  were  not  true  "applicants,"  in  that 
they  had  just  recently  been  sworn  Into  the  Army.  To  approximate  the 
applicant  response  set  as  closely  as  possible,  recruits  were  allowed  to 
believe  that  their  scares  on  these  inventories  might  impact  on  their  Army 
careers.  Recruits  were  then  asked  to  complete  the  ABLE  and  AVOICE,  after 
which  they  were  debriefed. 

To  examine  the  extent  to  which  recruits  actually  believed  their  ABLE  and 
AVOICE  scores  would  have  an  effect  on  the^r  Army  career,  a  single  item  was 
filled  out  by  each  recruit  prior  to  debriefing.  Of  the  126  recruits  in  this 


Honesty  and  Faking  Effects,  ABLE  Content  Scales:  Fort  Bragg 


sample,  57  responded  "yes"  to  the  question  of  whether  they  believed  scores 
would  have  an  impact,  61  said  "no,"  and  8  wrote  in  that  they  didn't  know. 
Thus,  while  the  MEPS  sample  was  not  a  true  "applicant"  sample,  its  make-up 
was  reasonably  close  to  It  (recently  sworn-in  recruits,  close  to  half  of  whom 
believed  their  A8LE  and  AVOICE  scores  would  affect  their  Army  career). 

Table  11.49  shows  mean  scores  for  MEPS  recruits  and  the  two  "honest" 
conditions  of  this  study,  Fort  Bragg  and  Fort  Knox.  In  total,  these  results 
suggest  that  intentional  distortion  may  nut  be  a  significant  problem  In  an 
applicant  setting.  (Faking  or  distortion  In  a  draft  situation  cannot  even  be 
estimated  In  the  present  U.S.  situation.) 

Overall,  the  ABLE  data  show  that: 

•  Soldiers  can  distort  their  responses  when  Instructed  to  do  so. 

•  The  response  validity  scales  detect  Intentional  faking. 

•  An  Individual's  Social  Desirability  scale  score  can  be  used  to  adjust 
his  or  her  content  scale  scores  to  reduce  variance  associated  with 
faking. 

•  Faking  or  distortion  may  not  be  a  significant  problem  in  an  applicant 
setting. 

Faking  Study  Results  -  Interest  Inventory.  We  divided  the  Interest 

scales  Into  the  two  groups,  Combat-ftelated  ana  Combat  Support,  that  emerged 
from  the  factor  analyses  and  multivariate  analysis  of  variance  (MANOVA)  on 

the  Fort  Bragg  data.  Nine  of  the  11  Combat-Related  AVOICE  scales  are 

sensitive  to  Intentional  distortion,  and  9  of  the  12  Combat  Support  scales 
are  sensitive  to  intentional  distortion.  The  interactions  of  Fake  by  Set  by 
Order  are  significant  at  p  *  .05,  indicating  that  order  of  conditions  In 
which  the  participants  completed  the  AVOICE  also  affected  the  result.  Tables 
11.50  and  11.51  show  mean  scores  for  the  various  conditions  when  the 

particular  condition  was  the  first  administration. 

Whan  told  to  distort  their  responses  so  that  they  are  not  likely  to  be 
placed  in  combat-related  occupational  specialties  (MOS)— that  is,  Instructed 
to  Fake  Non-combat--soldiers  tended  to  decrease  their  scores  on  all  scales. 
Scores  on  19  of  24  Interest  scales  were  lower  In  Fake  Non-combat  as  compared 
to  the  honest  condition.  In  the  Fake  Combat  condition,  soldiers  in  general 
Increased  their  Combat-Related  scale  scores  and  decreased  their  Combat 
Support  scale  scores. 

The  A3LE  response  validity  scales  were  then  used  to  adjust  AVOICE  scale 
scores  for  Faking  Combat  and  Faking  Non-combat.  Comparing  these  differences 
to  the  unadjusted  differences  revealed  that  these  adjustments  have  little 
effect,  perhaps  because  the  response  validity  scales  consisted  of  items  from 
the  ABLE  and  the  faking  instructions  for  the  ABLE  and  AVOICE  were  different 
(i.e..  Fake  Good  and  Fake  Bad  vs.  Fake  Combat  and  Fake  Non-combat). 

Again  the  question  was  explored  of  whether  applicants  might  tend  to 
distort  their  responses  to  the  AVOICE.  The  mean  scores  for  the  MEPS  recruits 
and  the  two  Honest  conditions,  Fort  Bragg  and  Fort  Knox,  showed  no  particular 
pattern  to  the  mean  score  differences. 


Effects  of  Faklog.  ATOJCE  Mot  Support  Scales:  Fort  Bragg 


Overall,  the  AVOICE  data  show  that: 

•  Soldiers  can  distort  their  responses  when  Instructed  to  do  so. 

•  The  ABLE  Social  Desirability  and  Poor  Impression  scales  are  not  as 
effective  for  adjusting  AVOICE  scale  scores  as  they  are  for  adjusting 
ABLE  content  scale  scores. 

•  Faking  or  distortion  may  not  be  a  significant  problem  In  an  applicant 
setting, 


Summary  of  Field  Test  Results 

The  data  on  field  tests  presented  In  this  section  were  crucial  for  the 
final  revisions  of  the  Pilot  Trial  Battery  (PTB)  before  It  was  used  In  the 
Concurrent  Validation.  The  revisions  based  on  the  field  tests  are  described 
in  Section  7. 
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Section  7 


TRANSFORMING  THE  PILOT  TRIAL  BATTERY  INTO  THE  TRIAL  BATTERY 


The  entire  Pilot  Trial  Battery,  as  administered  at  the  field  tests, 
required  approximately  6.5  hours  of  administration  time.  However,  the  Trial 
Battery,  which  was  the  label  reserved  for  the  predictor  battery  to  be  used  In 
the  full-scale  Concurrent  Validation,  had  to  fit  In  a  4-hour  time  slot. 

Three  general  principles,  consonant  with  the  theoretical  and  practical 
orientation  that  had  been  used  since  the  Inception  of  the  project,  guided  the 
revision  and  reduction  decisions: 

1.  Maximize  the  heterogeneity  of  the  battery  by  retaining  measures  of 
as  many  different  constructs  as  possible, 

2.  Maximize  the  chances  of  Incremental  validity  and  classification 
efficiency. 

3.  Retain  measures  with  adequate  reliability. 

Using  all  accumulated  Information,  the  final  decisions  were  made  In  a 
series  of  meetings  attended  by  the  project  staff  and  by  the  Scientific 
Advisory  Group.  Considerable  discussion  was  generated  at  these  meetings,  but 
the  group  was  able  to  reach  a  consensus  on  the  reductions  and  revisions  to  be 
made  to  the  Pilot  Trial  Battery. 

The  recommendations  for  revisions  and  the  reasons  for  their  adoption  are 
described  In  the  following  pages. 


Changes  to  Cognitive  Paper-and-PencIl  Tests 

Changes  to  the  cognitive  paper-and-pencl 1  tests  are  summarized  In  Table 
11.52. 


The  Spatial  Visualization  construct  In  the  PTB  was  measured  by  three 
tests:  Assembling  Objects,  Object  Rotation,  and  Shapes.  The  Shapes  Test  was 
dropped  because  the  evidence  of  validity  for  job  performance  for  tests  of 
this  type  was  judged  to  be  less  Impressive  than  for  the  other  two  tests.  The 
Object  Rotation  Test  was  not  changed.  Eight  Items  were  dropped  from  the 
Assembling  Objects  Test  by  eliminating  those  Items  that  were  very  difficult 
or  very  easy,  or  that  had  low  Item-total  correlations.  The  time  limit  for 
Assembling  Objects  was  not  changed;  the  effect  was  to  make  Assembling  Objects 
more  a  power  test  than  It  was  prior  to  the  changes. 

The  Spatial  Scanning  construct  was  measured  by  two  tests,  Mazes  and 
Path.  The  Path  Test  was  dropped  and  the  Mazes  Test  was  retained  with  no 
changes.  Mazes  showed  higher  test-retest  reliabilities  than  Path  (.71  vs. 
.64),  and  gain  scores  were  lower  (.24  SO  units  for  Mazes  vs.  .62  SD  units  for 
Path),  which  was  desirable.  Also,  Mazes  was  a  shorter  test  than  Path  (5.5 
minutes  versus  8  minutes). 


Table  11.52 


Sunnary  of  Changes  to  Paper-and-PencIl  Cognitive  Measures 
In  the  Pilot  Trial  Battery 


Test  Name 

Changes 

Assembling  Objects 

Decrease  from  40  to  32  Items. 

Object  Rotation 

Retain  as  Is  with  90  Items. 

Shapes 

Drop  Test. 

Mazes 

Retain  as  Is  with  24  items. 

Path 

Drop  Test, 

Reasoning  1 

Retain  as  Is  with  30  Items. 

New  name  REASONING  TEST. 

Reasoning  2 

Drop  Test. 

Orientation  1 

Drop  Test. 

Orientation  2 

Retain  as  Is  with  24  Items. 

New  name  ORIENTATION  TEST. 

Orientation  3 

Retain  as  Is  with  20  Items, 

New  name  MAP  TEST. 

The  Flgural  Reasoning  construct  was  measured  by  Reasoning  Test  1  and 
Reasoning  Test  2.  Reasoning  1  was  evaluated  as  the  better  of  the  two  tests 
because  It  had  higher  reliabilities  for  both  Internal  consistency  (alpha  ■ 
.83  vs.  .65  and  separately  timed,  split-half  coefficients  *  .78  vs.  .63)  and 
test-retest  (.64  vs.  .57),  as  well  as  a  higher  uniqueness  estimate  (.49  vs. 
.37).  Reasoning  1  was  retained  with  no  Item  or  time  limit  changes  and 
Reasoning  2  was  dropped.  Reasoning  Test  1  was  renamed  Reasoning  Test. 

Three  tests  In  the  PTB  measured  the  Spatial  Orientation  construct. 
Orientation  Test  1  was  dropped  because  It  showed  lower  test-retest 
reliabilities  (.67  vs.  .80  and  .84)  and  higher  gain  scores  (.63  SD  units  vs. 
.11  and  .08  SD  units).  In  addition,  we  modified  the  instructions  for 
Orientation  Test  2  because  field  test  experience  had  Indicated  that  the  PTB 
Instructions  were  not  as  clear  as  they  should  be.  Orientation  Test  2  was 
renamed  Orientation  Test.  Orientation  Test  3  was  retained  with  no  changes  and 
renamed  the  Map  Test. 

Changes  to  Computer-Administered  Tests 

Besides  the  changes  made  to  specific  tests,  several  Improvements  were 
made  to  the  computer  battery  as  a  whole. 


General  Improvements 

The  more  general  changes  were  as  follows. 

1.  virtually  all  test  Instructions  were  modified: 

•  Post  Instructions  were  shortened  considerably. 

•  Names  of  buttons,  slides,  and  switches  on  the  response  pedestals 
were  written  In  capital  letters  whenever  they  appeared. 

•  Where  possible,  the  following  standard  outline  was  followed  In 
preparing  the  Instructions: 

-  Test  name 

-  One-sentence  description  of  the  purpose  of  the  test 

-  Step-by-step  test  Instructions 

-  One  practice  item 

-  Brief  restatement  of  test  Instructions 

-  Two  or  three  additional  practice  items 

-  Instructions  to  call  test  administrator  If  there  are 
questions  about  the  test. 

2.  Whenever  the  practice. Items  had  a  correct  response,  the  subject  was 
given  feedback. 

3.  Rest  periods  were  eliminated  from  the  battery.  This  was  possible 
because  virtually  every  test  was  shortened. 

4.  The  computer  programs  controlling  test  administration  were  merged 
Into  one  super-program,  eliminating  the  time  required  to  load  the 
programs  between  tests. 

5.  The  format  and  parameters  used  in  the  software  containing  test  times 
were  reworded,  so  that  the  software  was  more  "self-documented." 

6.  The  total  time  allowed  for  subjects  to  respond  to  a  test  Item  (In 
other  words,  response  time  limit)  was  set  at  9.0  seconds  for  all 
reaction  time  tests  (Simple  and  Choice  Reaction  Time,  Short-Term 
Memory,  Perceptual  Speed  and  Accuracy,  and  Target  Identification). 
In  the  PTB  version,  the  response  time  limit  had  varied  from  test  to 
test,  for  no  particular  reason. 

7.  Also,  with  regard  to  the  reaction  time  tests,  the  software  was 
changed  so  that  the  stimulus  for  an  Item  disappears  when  the  subject 
lifts  his  or  her  hand  from  the  home  button  (to  make  a  response). 
Subjects  had  been  Instructed  not  to  lift  their  hands  from  the  home 
button  until  they  had  determined  the  correct  response,  so  that 
separate  measures  of  decision  and  movement  time  could  be  obtained. 


However,  more  than  a  few  of  the  field  test  subjects  continued  to 
study  the  item  stimulus  after  leaving  the  home  buttons.  By  causing 

the  item  to  disappear,  we  hope  to  eliminate  that  problem. 


Changes  to  Specific  Tests 

Changes  to  the  individual  computer-administered  tests  are  summarized  In 
Table  11.53. 

No  changes  were  recommended  for  Simple  Reaction  Time.  However,  the 
order  of  the  pretrial  Intervals  (the  Interval  between  the  time  the  subject 
depresses  the  nome  buttons  and  the  trial  stimulus  appears)  was  randomized. 

Fifteen  Items  were  added  to  Choice  Reaction  Time  in  an  attempt  to 
Increase  the  test-retest  reliability  for  mean  reaction  time  on  this  test. 

Twelve  items  were  eliminated  from  the  Perceptual  Speed  and  Accuracy  Test 
(reduced  from  48  to  36  Items),  primarily  to  save  time.  Internal  consistency 
estimates  were  high  for  scores  on  this  test  (.83,  .96,  .88,  and  .74  for 
percent  correct,  mean  reaction  time,  slope,  and  intercept,  respectively),  and 
reduction  in  the  number  of  Items  did  not  seem  to  be  cause  for  concern. 

Several  changes  were  made  to  the  Target  Identification  Test.  First,  one 
of  the  two  Item  types,  the  "moving"  items,  was  eliminated.  Field  test  data 
showed  that  scores  of  the  "moving"  and  stationary  Items  correlated  .78,  and 
the  moving  items  had  lower  test-retest  reliabilities  than  stationary  Items 
(.54  vs.  .74).  All  target  objects  were  made  the  same  size  (5015  of  the 
objects  depicted  as  possible  answers)  since  field  test  analyses  Indicated 
size  had  no  appreciable  effect  on  reaction  time.  A  third  level  of  angular 
rotation  was  added  so  that  the  target  objects  were  rotated  either  0°,  45°,  or 
75°.  Theoretically,  and  as  found  In  past  research,  reaction  time  Is  expected 
to  Increase  with  greater  angular  rotation.  Two  of  the  item  parameters  were 
not  changed  (position  of  correct  response  object  and  direction  of  target 
object).  Finally,  the  number  of  items  was  reduced  from  48  to  36  to  save 
time.  Internal  consistency  and  test-retest  estimates  Indicated  that  the 
level  of  risk  attached  to  this  reduction  was  acceptable.  (For  mean  reaction 
time,  the  internal  consistency  estimate  was  .96  and  the  test-retest  estimate 
was  .67.)  I  he  test,  as  modified,  then  had  two  items  In  each  of  18  cells 
determined  by  crossing  angular  rotation  (0°,  45°,  75°),  position  of  correct 
response  object  (left,  center,  or  middle  of  screen),  and  direction  of  target 
object  (left-facing  or  right-facing). 

One  item  parameter  (probe  delay  period)  was  eliminated  from  the  Short- 
Term  Memory  Test,  while  two  others  were  retained  [Item  type  (symbolic  vs. 
letter)  and  Item  length  (one,  two,  or  five  objects)].  Analyses  of  field  test 
data  showed  that  probe  delay  period  did  not  significantly  affect  mean 
reaction  time  scores.  To  save  time,  12  Items  were  eliminated.  Two  of  the 
three  most  Important  scores  for  this  test  appeared  to  have  high  enough 
reliabilities  to  withstand  such  a  reduction. 

The  number  of  Items  on  the  Cannon  Shoot  Test  was  reduced  from  48  to  36, 
again  to  save  tine.  Internal  consistency  and  test-retest  reliabilities  for 
the  time  error  score  were  high  enough  (.88  and  .66,  respectively)  to  warrant 


Table  11.53 

Sunnary  of  Changes  to  Computer-Administered  Measures  In  the 
Pilot  Trial  Battery 

Test  Name  Changes 


COGNITIVE/PERCEPTUAL  TESTS 
Demographics 


Simple  Reaction  Time 
Choice  Reaction  Time 


Eliminate  race,  age,  and  typing  experience 
Items.  Retain  SSN  arid  video  experience 
Items. 

No  changes. 

Increase  number  of  Items  from  15  to  30. 


Perceptual  Speed  8  Accuracy  Reduce  Items  from  48  to  36.  Eliminate  word 

Items. 


Target  Identification 


Short-Term  Memory 


Cannon  Shoot 


Reduce  items  from  48  to  36.  Eliminate 
moving  items.  Allow  stimuli  to  appear  at 
more  angles  of  rotation. 

•Reduce  Items  from  48  to  36.  Establish  a 
single  Item  presentation  and  probe  delay 
period. 

Reduce  Items  from  48  to  36. 


Number  Memory 


Reduce  items  from  27  to  18.  Shorten  Item 
strings.  Eliminate  Item  part  delay  periods. 


PSYCHOMOTOR  TESTS 


Target  Tracking  1 


Reduce  items  from  27  to  18.  Increase  item 
difficulty. 


Target  Tracking  2 


Reduce  Items'  from  27  to  18.  Increase  Item 
dl fflculty. 


Target  Shoot 


Reduce  Items  from  40  to  30  by  eliminating 
the  extremely  easy  and  extremely  difficult 
Items. 


such  reduction  without  the  expectation  of  a  significant  impact,  on  reliabil¬ 
ity.  Also,  the  items  were  modified  so  that  all  targets  are  visible  on  the 

screen  at  the  beginning  of  the  trial  and  so  that  the  subject  Is  given  at 
least  2  seconds  to  view  the  speed  and  direction  of  the  target  before  the 

target  reaches  the  optimal  fire  point. 

Two  modifications  were  made  to  Number  Memory  to  reduce  test  administra¬ 
tion  time.  The  item  part  delay  period  was  made  a  constant  (1  second)  rather 

than  treated  as  a  parameter  with  two  levels  (0.5  and  2.5  seconds),  and  the 

Item  string  length  (number  of  parts  in  an  item)  was  changed  from  four,  six, 
or  eight  parts  to  two,  three,  or  four  parts.  These  changes  drastically 

reduced  the  time  required  to  complete  the  test.  As  a  result,  the  reduction 
In  the  number  of  items  that  had  been  recommended  was  not  necessary.  The 
Trial  Battery  version  of  this  test  had  28  Items,  constructed  so  that  there 
were  13  replications  of  the  four  arithmetic  operations  (add,  subtract, 
multiply,  and  divide). 

Similar  kinds  of  changes  were  made  to  Target  Tracking  Test  1  and  Target 
Tracking  Test  2.  Since  internal  consistency  and  test-retest  reliability 
estimates  were  relatively  high,  the  number  of  Items  was  reduced  from  2 7  to 
18.  The  difficulty  of  the  test  Items  was  Increased  by  increasing  the  speed 
of  the  crosshairs  and  the  target;  this  was  done  because  field  test  data 
indicated  that  the  mean  distance  score  was  positively  skewed.  Also,  the 
ratio  of  target  to  crosshairs  speed,  rather  than  target  speed,  was  used  as  a 
test  parameter.  It  seemed  that,  given  a  particular  crosshairs  speed,  the 
ratio  would  be  a  better  indicator  of  item  difficulty  than  the  actual  target 
speed. 

Several  changes  were  made  to  the  Target  Shoot  Test.  First,  all  test 
items  were  classified  according  to  three  parameters:  crosshairs  speed,  ratio 
of  target  to  crosshairs  speed,  and  item  complexity.  ( 1  ,e . ,  number  of 
turns/mean  segment  length).  Then,  items  were  revised  to  achieve  a  balanced 
number  of  items  in  each  cell  when  the  levels  of  these  parameters  were 
crossed.  This  had  the  result  of  "un-confoundlng"  these  parameters  so  that 
analyses  could  be  made  to  see  which  parameters  contributed  to  item 
difficulty.  Second,  extremely  difficult  items  were  eliminated  and  Item 
presentation  times  (the  time  the  target  was  visible  on  the  screen)  were 
Increased  to  a  minimum  of  6  seconds  (and  a  maximum  of  10  seconds).  This  was 
done  to  eliminate  a  severe  missing  data  problem  for  such  items  (as  much  as 
A0%)  discovered  during  field  tests.  Missing  data  occurred  when  subjects 
failed  to  "fire"  at  a  target.  These  "no-fires"  were  found  to  occur  where  the 
target  moved  very  rapidly  or  made  many  sudden  changes  In  direction  and  speed, 
or  the  item  lasted  only  a  few  seconds.  The  number  of  items  was  reduced  from 
40  to  30  to  save  testing  time,  primarily  by  eliminating  the  extremely  easy 
items. 


Changes  to  Non-Cognitive  Measures  (ABLE  and  AVOICE) 

Changes  to  the  non-cogn itive  measures  (ABLE  and  AVCICE)  are  summarized 
in  Table  11.54. 


Table  11.54 


Summary  of  Changes  to  Pilot  Trial  Battery  Versions  of  Assessment  of 
Background  and  Life  Experiences  (ABLE)  and  Army  Vocational  Interest 
Career  Examination  (AVOICE)* 


Inventory/Scale  Name 

ABLE-Total 

AVOICE-Total 

AVOICE  Expressed  Interest  Scales 
AVOICE  Single  Item  Holland  Scales 
AVOICE  Agriculture  Scale 
Work  Environment  Preference  Scales 


Changes 

Decrease  from  270  to  approximately 
209  Items. 

Decrease  from  309  to  approximately 
214  Items. 

Drop. 

Drop. 

Drop. 

Move  to  criterion  measure  booklet 
(delete  from  AVOICE  booklet). 


*  In  addition  to  the  changes  outlined  in  this  table  by  inventory  scale,  it 
was  recommended  that  all  ABLE  item  response  options  be  standardized  as 
three-option  responses  and  all  AVOICE  Item  response  options  be  standardized 
as  five-option  responses. 


Time  constraints  required  a  261!  reduction  In  the  total  number  of  ABLE 
and  AVOICE  Items.  The  goal  was  to  decrease  Items  on  a  scale-by-scale  basis, 
while  preserving  the  basic  content  of  each  scale.  The  strategy  adopted  to 
accomplish  this  fcr  each  scale  was  to: 

1.  -Sort  Items  into  content  categories. 

?.  Rank  order  items  within  category,  based  on  Item-scale  correlations. 

3.  Drop  Items  in  each  category  with  the  lowest.  Item-scale  correlations 
until  desired  number  of  items  for  that  scale  had  been  deleted. 

The  total  number  of  AVOICE  items  was  decreased  from  309  to  214. 
Thirty-eight  of  these  214  are  Items  on  the  Work  Environment  Preference 
scales.  It  was  decided  to  take  this  whole  section  out  of  the  AVOICE  booklet 
and  include  it  in  one  of  the  criterion  measure  booklets,  where  a  bit  more 
administration  time  was  available. 

A  decision  was  also  made  to  delete  the  Agriculture  scale,  the  six 
single-item  Holland  scales,  and  the  eight  Expressed  Interest  items. 
Reductions  made  on  the  remaining  AVOICE  scales  were  accomplished  using  the 
same  strategy  as  that  for  the  ABLE,  decreasing  scale  lengths  while  preserving 
heterogene ity. 


s 
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Description  of  the  Trial  Battery  and  Summary  Comments 


Table  11.55  shows  the  final  array  of  tests  for  the  Trial  Battery.  These 
are  the  measures  that  were  the  product  of  the  revisions  just  described. 

The  Trial  Battery  was  designed  to  be  administered  In  a  period  of  4  hours 
and  was  used  during  the  Concurrent  Validation  phase  of  Project  A,  in  which 
data  collection  began  in  FY85.  Data  collected  In  that  phase  will  allow  the 
first  look  at  the  validity  of  Trial  Battery  measures  against  job  performance 
criteria.  It  will  also  allow  replication,  on  a  much  larger  sample,  of  many 
of  the  analyses  described  In  the  development  and  testing  sections  presented 
in  this  report. 


Table  11.55 

Description  of  Measures  In  the  Trial  Battery 


COGNITIVE  PAPER-AND-Pl'NCIL  TESTS 

Number  of 

Items 

Time  Limit 
(minutes) 

Reasoning  Test 

30 

12 

Object  Rotation  Test 

90 

7.5 

Orientation  Test 

24 

10 

Maze  Test 

24 

5.5 

Map  Test 

20 

12 

Assembling  Objects  Test 

32 

16 

COMPUTER-ADMINISTERED  TESTS 

Number  of 

Items 

Approximate  Time 

Demographics 

2 

4 

Reaction  Time  1 

15 

2 

Reaction  Time  2 

30 

3 

Memory  Test 

36 

7 

Target  Tracking  Test  1 

18 

8 

Perceptual  Speed  and  Accuracy  Test 

36 

6 

Target  Tracking  Test  2 

18 

7 

Number  Memory  Test 

28 

10 

Cannon  Shoot  Test 

36 

7 

Target  Identification  Test 

36 

4 

Target  Shoot  Test 

30 

5 

NON-COGNITIVE  PAPER-AND-PENCIL 

INVENTORIES 

Number  of 

Items 

Approximate  Time 

Assessment  of  Background  and  Life 

209 

35 

Experiences  (ABLE) 

Army  Vocational  Interest  Career 

176 

20 

Examination  (AVQICE) 

CRITERION  DEVELOPMENT 


The  sections  Included  In  Part  III  describe  the  develop¬ 
ment  work  for  each  major  criterion  measure,  the  revisions 
made  on  the  basis  of  pilot  data,  the  procedure  used  for  the 
major  criterion  field  tests,  the  results  of  the  field  tests, 
and  the  final  revisions  made  on  the  basis  of  those  results 
for  use  In  Concurrent  Validation. 


Section  1 


INTRODUCTION  TO  CRITERION  DEVELOPMENT 


The  overall  goals  of  training  and  job  performance  (i.e.,  criterion) 
measurement  in  Project  A  are  to  define  the  total  domain  of  performance  In 
some  reasonable  way  and  then  develop  reliable  and  valid  measures  of  each 
major  factor.  The  specific  measures  are  to  be  used  as  criteria  against  which 
to  validate  selection  and  classification  tests  and  are  not  intended  to  serve 
as  operational  performance  appraisal  methods.  That  Is,  the  research 
participants  will  be  Informed  that  these  performance  measures  are  not  part  of 
their  formal  performance  appraisal  and  will  not  he  entered  Into  their 
personnel  file.  However,  this  does  not  mean  that  the  Project  A  measures 
cannot  be  modified  to  serve  as  useful  operational  performance  appraisals  In 
future  contexts;  we  certainly  hope  that  they  can  be. 

The  general  procedure  for  criterion  development  In  Project  A  followed  a 
basic  cycle  of  a  comprehensive  literature  review,  conceptual  development, 
scale  construction,  pilot  testing,  scale  revision,  field  testing,  and 
proponent  (management)  review.  The  specific  measurement  goals  are  to; 

•  Make  a  state-of-the-art  attempt  to  develop  job  sample  or  "hands-on" 
measures  of  job  task  proficiency. 

•  Compare  hands-on  measurement  to  paper-and-pencll  tests  and  rating 
measures  of  proficiency  on  the  same  tasks  (I.e,,  a  multitrait, 
multimethod  approach). 

•  Develop  rating  scale  measures  of  performance  factors  that  are  common 
to  all  first-tour  enlisted  MOS  (Army-wide  measures). 

•  Develop  standardized  measures  of  training  achievement  for  the  purpose 
of  determining  the  relationship  between  training  performance  and  job 
performance. 

Given  these  Intentions,  the  criterion  development  effort  focused  on 
three  major  methods:  hands-on  job  sample  tests,  multiple-choice  knowledge 
tests,  and  ratings.  The  behavlorally  anchored  rating  scale  (BARS)  procedure 
was  extensively  used  In  developing  the  rating  methods. 


Modeling  Performance 


The  development  efforts  to  be  described  were  guided  by  a  particular 
"theory"  of  performance.  The  Intent  was  to  proceed  through  an  almost  con¬ 
tinual  process  of  data  collection,  expert  review,  and  model/theory  revision. 
Two  Iterations  of  this  procedure  are  described  In  this  report.  The  basic 
outline  of  the  Initial  model  is  described  in  the  following  paragraphs. 
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A  first  basic  point  that  should  generate  no  argument  Is  that  Job  perfor¬ 
mance  Is  multidimensional.  There  Is  not  one  attribute,  one  outcome,  one 
factor,  or  one  anything  that  can  be  pointed  to  and  labeled  as  job  perfor¬ 
mance.  It  Is  perhaps  a  bit  more  arguable  to  go  on  from  there  and  assert  that 
job  performance  Is  a  construct  (which  Implies  a  "theory"  of  performance) ,  and 
Is  manifested  by  a  wide  variety  of  behaviors,  or  things  people  do,  that  are 
judged  to  be  important  for  accomplishing  the  goals  of  the  organization.  For 
example,  a  manager  could  make  contributions  to  organizational  goals  by  work¬ 
ing  out  congruent  short-term  goals  for  his  subordinates,  and  thereby  guiding 
them  In  the  right  direction,  or  by  praising  them  for  a  job  well  done,  and 
thereby  increasing  subsequent  effort  levels.  Each  of  these  activities 
probably  requires  different  knowledges  and  skills,  which  are  In  turn  most 
likely  a  function  of  different  abilities. 

Consequently,  for  any  particular  job,  one  fundamental  task  of  perfor¬ 
mance  measurement  Is  to  describe  the  basic  factors  that  comprise  perfor¬ 
mance.  That  is,  how  many  such  factors  are  there  and  what  Is  their  basic 
nature? 


Two  General  Factors 

For  the  population  of  entry-level  enlisted  positions  In  the  Army,  we 
postulated  that  there  are  two  major  types  of  job  performance  factors.  The 
first  Is  composed  of  performance  components  that  are  specific  to  a  particular 
job.  That  Is,  measures  of  such  components  would  reflect  specific  technical 
competence  or  specific  job  behaviors  that  are  not  required  for  other  jobs. 
For  example,  typing  correspondence  would  be  a  performance  component  for  an 
administrative  clerk  (MOS  71 L )  but  not  for  a  tank  crewman  (MOS  IDE).  We  have 
called  such  components  "MOS-specIf 1c"  criterion  factors. 

The  second  type  of  performance  includes  components  that  are  defined  and 
measured  In  the  same  way  for  every  job.  These  have  been  referred  to  as 
"Army-wide"  criterion  factors.  Examples  might  be  performance  on  the  common 
tasks  for  which  every  soldier  Is  responsible  or  proficiency  in  peer 
leadership. 

For  the  MOS-speciflc  components  we  anticipated  that  there  would  be  a 
relatively  small  number  of  distinguishable  subfactors  (or  constructs)  of 
technical  performance  that  would  be  a  function  of  different  abilities  or 
skills  and  that  would  be  reflected  by  different  task  content.  The  criterion 
construction  procedures  were  designed  to  Identify  technical  performance 
factors  that  reflected  different  task  content. 

The  Army-wide  concept  Incorporates  the  basic  notion  that  total  perfor¬ 
mance  Is  much  more  than  task  or  technical  proficiency.  It  might  Include  such 
things  as  contribution  to  teamwork,  continual  self-development,  support  for 
the  norms  and  customs  of  the  organization,  and  perseverance  In  the  face  of 
adversity.  A  much  more  detailed  description  of  the  Initial  working  model  for 
the  Army-wide  segment  of  performance  can  be  found  In  Corman,  Motowldlo,  Rose, 
and  Hanser  (1987a) . 
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In  sum,  the  working  model  of  total  performance  with  which  the  project 
began  viewed  performance  as  multidimensional  within  the  two  broad  categories 
of  factors  or  constructs.  The  job  analysis  and  criterion  construction 
methods  were  designed  to  "discover"  the  content  of  these  factors  via  an 
exhaustive  description  of  the  total  performance  domain,  several  Iterations  of 
data  collections,  and  the  use  of  multiple  methods  for  Identifying  basic 
performance  factors. 


Factors  Versus  a  Composite 

Saying  that  performance  Is  multidimensional  does  not  preclude  using  just 
one  Index  of  an  Individual's  contributions  to  make  a  specific  personnel 
decision  {e.g.,  select/not  select,  promote/not  promote).  As  argued  by 
Schmidt  and  Kaplan  (1971)  some  years  ago,  It  seems  quite  reasonable  for  the 
organization  to  scale  the  Importance  of  each  major  performance  factor 
relative  to  a  particular  personnel  decision  that  must  be  made,  and  to  combine 
the  weighted  factor  scores  Into  a  composite  that  represents  the  total  contri¬ 
bution  or  utility  of  an  Individual's  performance,  within  the  context  of  that 
decision.  That  Is,  the  way  In  which, performance  Information  Is  weighted  Is  a 
value  Judgment  on  the  organization's  part.  The  determination  of  the  specific 
combinational  rules  (e.g.,  simple  sum,  weighted  sum,  non-linear  combination) 
that  best  reflect  what  the  organization  Is  trying  to  accomplish  Is  In  large 
measure  a  research  question. 

In  sum,  It  makes  sense  to  assert  that  performance  In  a  particular  Job  Is 
made  up  of  several  relatively  Independent  components  and  then  ask  how  each 
component  relates  to  some  continuum  of  overall  utility.  It  Is  quite  possible 
for  people  with  different  strengths  and  weaknesses  on  the  performance  factors 
to  have  very  similar  overall  utility  for  the  organization. 


A  Structural  Model 

If  performance  is  characterized  In  the  above  manner,  then  a  more  formal 
way  to  model  performance  Is  to  think  In  terms  of  Its  latent  structure.  The 
usual  common  factor  model  of  the  latent  structure  Is  open  to  criticism 
because  all  of  the  criterion  (l.e.,  performance)  measures  may  not  be  at  the 
same  level  of  explanation  or  they  may  be  so  qualitatively  different  that 

putting  them  Into  the  same  correlation  matrix  does  not  seem  appropriate.  For 
example,  combining  the  dichotomous  variable  stay  vs.  leave  (voluntarily)  with 
a  hands-on  job  performance  test  score  seems  like  a  strange  thing  to  do. 

Also,  two  criteria  may  not  be  functionally  Independent.  One  might  be  a  cause 
of  another;  for  example,  Individual  differences  In  training  performance  may 
be  a  cause  of  Individual  differences  In  job  performance. 

Considerations  such  as  the  above  have  led  some  people  to  propose 

structural  equation  modeling  as  a  way  to  portray  the  criterion  space  and  the 
associated  predictor  space  more  meaningfully  (e.g.,  Bentler,  1980;  James, 

Mullak,  &  Brett,  1982). 

From  this  perspective,  the  alms  of  criterion  analysis  are  to  use  all 
available  evidence,  theory,  and  professional  judgment  to  (a)  Identify  the 
variables  that  are  necessary  and  sufficient  to  explain  the  phenomena  of 
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Interest,  and  (b)  specify  the  nature  of  the  relationships  between  pairs  of 
variables  in  terms  of  whether  they  1)  are  correlated  because  one  is  a  cause 
of  another,  2)  are  correlated  because  both  are  manifestations  of  the  same 
latent  property,  or  3)  are  Independent.  The  more  explicitly  the  causal 
directions  and  the  predicted  magnitude  of  the  associations  can  be  specified, 
the  greater  the  potential  power  of  the  model.  That  Is,  It  more  clearly  out¬ 
lines  the  kinds  of  data  to  be  collected  and  the  kinds  of  analyses  to  be  done, 
and  It  provides  a  much  more  explicit  framework  within  which  to  Interpret 
empirical  results. 

Within  the  structural  equation  framework  there  are  two  general  kinds  of 
models,  one  dealing  with  manifest  variables  (operational  measures)  and  one 
with  latent  variables  (constructs).  The  most  thorough  portrayal  of  a  domain 
presumably  Involves  both*,  certainly  In  Project  A  we  have  assumed  that  It 
does.  The  proposal  and  research  plans  have  dealt  explicitly  with  criterion 
constructs  and  criterion  measures.  What  we  really  want  to  model,  In  terms  of 
■Identifying  the  necessary  and  sufficient  variables  and  their  causal  Inter¬ 
relationships,  are  the  more  "fundamental"  underlying  constructs.  What  we  In 
fact  will  have  are  operational  measures  that  represent  the  constructs. 

A  few  po1nts--some  general,  some  spedf1c--should  be  made  about  this 
view.  First,  It  Is  true  that  we  simply  know  a  lot  more  about  predictor  con¬ 
structs  than  we  do  about  job  performance  constructs.  There  are  volumes  of 
research  on  the  former,  and  almost  none  on  the  latter.  For  personnel 
psychologists  It  Is  almost  second  nature  to  talk  about  predictors  In  terms  of 
constructs.  However,  Investigation  of  job  performance  constructs  seems  to 
have  been  limited  to  those  few  studies  dealing  with  synthetic  validity  and 
those  using  the  critical  Incidents  format  to  develop  performance  factors. 
Relative  to  the  latter,  the  occupations  receiving  the  most  attention  have 
been  managers,  nurses,  firefighters,  police  officers,  and  perhaps  college 
professors  (cf.  Landy^l  Farr,  1983).  Relatively  little  attention  has  been 
given  to  conceptualizing  performance  In  clerical,  technical,  or  skilled  jobs. 

Second,  the  usual  textbook  Illustration  of  a  latent  structural  equation 
model  (e.g.,  James  et  al.,  1982)  shows  each  latent  variable  being  represented 
by  one  or  more  manifest  operational  measures.  However,  In  our  situation, 
just  as  It  is  easy  to  think  of  examples  where  a  predictor  test  score  could  be 
a  function  of  more  than  one  latent  variable  (e.g.,  the  score  on  computerized 
two-hand  tracking  apparatus  could  be  a  function  of  several  latent  psychomotor 
"factors"),  the  same  will  be  true  of  criterion  measures.  Most  of  them  will 
not  be  unidimensional . 

Third,  we  would  be  hard-pressed  to  defend  placing  the  criterion  vari¬ 
ables  bn  some  continuum  from  Immediate  to  Intermediate  to  ultimate  as  a  means 
for  portraying  their  relative  Importance  or  functional  Interrelationships. 
For  example,  although  there  are  good  reasons  for  developing  hands-on  perfor¬ 
mance  measures,  we  would  not  be  willing  to  defend  hands-on  performance  scores 
as  the  "most  ultimate"  measure. 
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Unit  vs.  Individual  Performance 

Finally,  people  do  not  usually  work  alone.  Individuals  are  members  of 
work  groups  or  units  and  It  Is  the  unit's  performance  that  frequently  is  the 
most  central  concern  of  the  organization.  However,  determining  the  Individ¬ 
ual's  contribution  to  the  unit's  score  Is  not  a  simple  problem.  Further, 
variation  in  unit  performance  Is  mo.st  likely  a  function  of  a  number  of 
factors  besides  the  "true"  level  of  performance  of  each  individual.  The 
quality  of  leadership,  weather*  conditions,  or  the  availability  of  spare  parts 
are  examples  of  such  additional  sources  of  variation  In  unit  performance.  In 
addition,  there  probably  are,  somewhere,  interactions  between  the  character¬ 
istics  of  Individuals  and  the  characteristics  of  units  or  situations. 

For  two  major  reasons,  Project  A  has  not  incorporated  unit  effectiveness 
In  Its  model  of  performance.  First,  the  project  Is  focused  on  the  develop¬ 
ment  of  a  new  selection/classification  system  for  entry-level  personnel  and 
Is  concerned  with  Improving  personnel  decisions  about  Individuals,  and  not 
units.  The  task  Is  to  maximize  the  average  payoff  per  Individual  selected. 
The  Army  cannot  make  differential  assignments  base3  on  differences  in 
weather,  leadership  climate,  and  so  on.  Future  conditions  cannot  be 
predicted  with  any  certainty  and  during  a  tour  of  duty  an  Individual  will 
serve  In  several  different  units.  Consequently,  personnel  assignments  must 
be  optimal  when  averaged  across  all  such  conditions,  Dy  design,  they  should 
not  take  situational  Interactions  Into  account.  Operationally,  these  sources 
of  variation  must  be  dealt  with  by  other  means  (e.g.,  leadership  training). 
However,  In  a  research  context,  Project  A  is  attempting  ,to  Investigate  these 
additional  sources  of  variation  via  a  systematic  description  of  the  work 
environment.  The  Army  Work  Environment  Questionnaire  (Olsen,  Borman, 
Robertson,  &  Rose,  1984)  asks  job  Incumbents  to  describe  14  dimensions  of 
their  job  situation  using  a  44-Item  questionnaire. 

The  second  major  reason  for  not  using  unit  performance  as  a  criterion  Is 
the  prohibitive  cost.  It  simply  was  not  possible  to  develop  reliable  and 
valid  field  exercises  for  assessing  unit  performance  In  a  representative 
sample  of  MOS  within  a  reasonable  time  frame.  In  Isolated  Instances  It  might 
be  possible  to  take  advantage  of  regularly  scheduled  exercises  or  use 
existing  performance  records  that  a  particular  unit  (e.g.,  maintenance  depot) 
might  keep.  However,  It  proved  not  possible  to  obtain  such  data  In  any 
systematic  way.  Even  If  it  could  be  done,  It  would  not  be  easy  to  establish 
trie  correspondence  between  individual  performance  and  unit  effectiveness. 

What  we  have  chosen  to  do  Is  to  try  to  Identify  the  factors,  or  means, 
by  which  Individuals  contribute  to  unit  performance  and  to  assess  individual 
performance  on  those  factors  via  rating  methods.  At  the  same  time  we  are 
attempting  to  determine  how  much  of  the  variance  In  Individual  performance  is 
accounted  for  by  the  situational  characteristics  assessed  b,y  the  Army  Work 
Environment  Questionnaire. 

Plan  for  Part  III 

With  the  above  discussion  as  background,  we  now  turn  to  describing  the 
development  steps  for  each  of  the  major  performance  measures.  Once  the 
Initial  array  of  criteria  h ^ "  been  described,  the  procedures  and  results  of 
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the  full-scale  criterion  field  tests  will  be  summarized.  Finally,  the  revi¬ 
sions  of  the  performance  measures  that  were  made  on  the  basis  of  the  field 
test  results  and  the  reviews  by  the  Army  proponents  will  be  outlined. 

At  that  point  the  final  array  of  performance  measures  to  be  used  in  the 
Concurrent  Validation  will  have  been  established.  Again,  It  Is  the  Intent  of 
the  project  that,  within  the  limits  of  Its  money  and  time,  this  array  will 
describe  the  entire  performance  space  and  utilize  all  feasible  means  of 
assessing  performance. 


Section  2 


DEVELOPMENT  OF  MEASURES  OF  TRAINING  SUCCESSl 


The  purpose  of  this  section  Is  to  describe  the  development  of  the 
achievement  tests  used  to  measure  training  success  for  the  19  MOS  In  the 
Project  A  sample.  The  tests  were  developed  In  three  batches  (A,  B,  and  Z). 
Batch  A  and  Batch  B  are  defined  as  described  In  Part  I  and  Batch  Z  Is  com¬ 
posed  of  the  remaining  10  MOS.  The  complete  lineup  Is  shown  In  Table  II1.1. 
The  methods  used  to  develop  materials  for  the  three  batches  were  essentially 
the  same  except  for  modification,  as  described  below,  to  take  advantage  of 
experience  with  prior  batches.  All  measures  were  pilot  tested  with  a  sample 
of  50  soldiers  at  the  conclusion  of  Advanced  Individual  Training  (AIT). 
Batches  A  and  B  were  alsc  field  tested  with  job  Incumbents;  the  tests  In 
Batch  Z  were  not. 


The  Measurement  Model 


The  Construct  of  Training  Success 

The  Project  A  Research  Plan  defines  training  success  in  terms  of  the 
Individual  trainee's  achievement.  The  original  Statement  of  Work  used  the 
term  in  a  similar  way,  that  Is,  to  refer  to  specific  training  measures  taken 
on  soldiers  In  the  course  of  training,  such  as  those  Included  In  the  TRADOC 
Educational  Data  System  (TREDS)  and  the  Automated  Instruction  Management 
System  (AIMS).  Many  of  these  measures  Include  both  hands-on  and  cognitive 
Instruments. 

Training  success  encompasses  both  the  outcomes  of  formal  training  and 
organizational  socialization.  Organizational  socialization  Is  defined  as  the 
way  In  which  soldiers  accommodate  to  their  role  as  soldiers  and  "learn  the 
ropes,"  such  as  the  attitudes,  standards,  and  patterns  of  behavior  expected 
of  soldiers  In  general  and  of  soldiers  In  an  assigned  MOS.  Organizational 
socialization  Is  achieved  through  formal  training,  of  course,  but  It  also 
Is  developed  through  a  variety  of  tactics  outside  of  the  regular  classroom, 
Including  role  modeling,  drill,  stressful  experiences,  NCO  leadership,  and 
other  practices  designed  to  produce  appropriate  military  attitudes. 

A  wide  variety  of  potentially  useful  measures  either  are  available  or 
could  be  created  to  assess  three  major  elements  of  training  success:  (a)  the 
knowledge  component,  (b)  the  hands-on,  or  performance,  component,  and  (c)  the 
organizational  socialization  component.  The  achievement  tests,  described 


lThls  section  Is  based  primarily  on  an  ARI  Technical  Report  757,  Development 
and  F.eld  Test  of  Job-Relevant  Knowledge  Tests  for  Selected  MOS,  by 
Robert  fT  E)a v  1  s ,  Gregory  S"!  Davis ,  John  N.  Joyner ,  and  "Maria  Veronica 
de  Vera,  and  a  supplementary  ARI  Research  Note,  in  preparation,  which 
contains  the  report  appendixes. 


below,  are  designed  to  measure  the  knowledge  component  ot  formal  training 
experience,  specifically  AIT.  This  component  of  training  success  includes 
two  types  of  knowledge:  (a)  knowledge  about  the  job,  as  taught  In  AIT,  and 
(b)  knowledge  about  a  wide  range  of  "common  skills"  that  cut  across  all  MOS 
and  that  all  soldiers  are  expected  to  know. 


Table  III.1. 

Military  Occupational  Specialties  (MOS)  Included  In  Batches  A,  B,  and  Z 


Batch  A 


13B  Cannon  Crewman 
64C  Motor  Transport  Operator 
71L  Administrative  Specialist 
95B  Military  Police 


Batch  B 


Infantryman 
Armor  Crewman 
Radio  Teletype  Operator 
Light  Wheel  Vehicle  Mechanic 
Medical  Specialist 


Batch  Z* 


12B  Combat  Engineer 

16S  MANPADS  Crewman 

27E  TOW/Dragon  Repairer 

51B  Carpentry/Masonry  Specialist 

54E  NBC  Specialist 

55B  Ammunition  Specialist 

67N  Utility  Helicopter  Repairer 

76W  Petroleum  Supply  Specialist 

76Y  Unit  Supply  Specialist 

94B  Food  Service  Specialist 

19K  Ml  Abrams  Armor  Crewman13 


aNot  field  tested  with  job  Incumbents. 

^Developed  for  longitudinal  validation;  not  Included  In  the  Concurrent 
Validation. 


Relationship  Between  Training  Content  and  Job  Content 


Within  the  military,  there  Is  a  very  close  relationship  between  training 
content  and  tasks  performed  on  the  job.  Skill  Level  1  soldiers  within  any 
given  MOS  may  work  at  quite  different  jobs— that  Is,  jobs  that  emphasize  dif¬ 
ferent  skills — but  It  Is  almost  always  the  case  that  the  knowledges  and 
skills  necessary  for  the  performance  of  a  job  at  Skill  Level  1  are  taught  In 
AIT.  As  a  matter  of  doctrine  training  must  be  job-related,  and  in  developing 
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training  objectives  and  materials  Army  personnel  make  every  effort  to  ensure 
that  they  are  job-related.  As  a  result,  if  a  content-valid  test  is  created 

based  on  curricular  materials  alone,  one  can  assume  that  most  of  the  items 
will  be  job-related.  While  school  curricula  do  sometimes  Include  topics  or 
tasks  that  are  unrelated  to  the  job,  this  is  the  exception  rather  than  the 
rule. 

Classes  of  Items 

It  seems  clear  that  seme  trainees  learn  Important  job  skills  that  are 
not  taught  in  the  schools.  As  a  result  of  extracurricular  activities,  out¬ 
side  study,  generalization,  or  ell  three,  a  trainee  may  develop  some  job 
skills  In  the  school  setting  that  are  not  taught  as  part  of  the  curriculum. 
From  the  perspective  of  criterion  development,  one  might  hypothesize  that  the 
exceptional— that  Is,  most  successful— trainee  Is  one  who  goes  beyond  the 
formal  curriculum  and  learns  such  knowledge  and  skills.  In  education,  the 
jargon  term  for  this  phenomenon  Is  Incidental  learning. 

Similarly,  military  training  performance  is  predictive  of  later  military 
job  performance  because  (a)  training  performance  reflects  general  learning 
ability  (and  hence  identifies  who  will  acquire  knowledge  on  trie  job),  (b)  the 
Information  acquired  In  training  Is  Itself  a  significant  factor  In  job 
performance,  or,  more  likely,  (c)  both.  Accordingly,  we  constructed  two 
subsets  of  test  Items— one  reflecting  training  content  and  the  other  job 
content.  Where  a  sufficient  number  of  test  Items  could  be  developed  for  both 
classes,  scores  on  the  two  types  of  items  may  shed  light  on  the  relationships 
between  success  In  training  and  success  on  the  job.  That  Is,  is  the 
correlation  between  training  performance  and  job  performance  a  function  of 
achievement  during  training,  Incidental  learning  during  training,  or 
Individual  differences  In  basic  abilities  that  are  present  before  training 
starts? 

The  Meaning  of  Content  Validity 

Although  definitions  of  content  validity  differ,  the  literature  stresses 
three  critical  components:  clarity  of  the  content  domain,  representativeness 
of  content,  and  relevance  of  content. 

By  domain  clarity,  we  mean  that  the  content  domain  should  be  defined 
unambiguously^  Essentially,  this  means  that  the  boundaries  of  the  domain 
from  which  test  content  Is  drawn  should  be  clearly  defined  and  understood. 
Experts  may  differ  as  to  the  appropriate  boundaries,  and  the  differences  may 
become  matters  of  disagreement  in  the  course  of  test  construction.  But  once 
the  boundaries  are  defined,  experts  should  be  able  to  agree  as  to  whether  or 
not  items  fall  inside  or  outside  of  those  boundaries.  At  the  outset,  we 
operationally  defined  the  content  domain  in  the  following  way.  For  training 
content,  the  domain  was  described  by  Programs  of  Instruction  (POIs)  lesson 
plans,  technical  publications,  Soldier's  Manuels,  and  the  Common  Task 
manual.  For  the  job,  content  was  specified  by  Army  Occupational  Surveys 
(AOSPs),  technical  publications,  Soldier's  Manuals,  and  the  Common  Task 
Manual . 


The  issue  of  content  representativeness  refers  to  the  question  of 
whether  the  domain  has  been  adequately  sampled.  Operationally,  establishing 

content  representativeness  involves  a  strategy  for  arriving  at  item  budgets, 
that  is,  budgets  for  the  number  of  items  on  a  test  to  cover  different  parts 
of  the  domain.  When  people  disagree  about  such  matters,  the  question  is 
normally  resolved  in  terms  of  the  level  of  expertise  of  those  making  the 
decision.  In  the  case  of  the  Project  A  achievement  tests,  the  procedure  for 
developing  Item  budgets  as  representative  samples  of  training  and  job  content 
was  determined  by  test  construction  experts  (i.e.,  project  staff)  but  the 
content  of  the  budget  was  evaluated  by  subject  matter  experts  (job  Incumbents 
and  trainers— SME). 

It  is  the  SME  evaluations  that  provide  the  judgments  of  content  rele¬ 
vance.  In  the  narrowest  sense,  we  may  simply  ask  whether  or  not  specific 
Items  are  relevant  to  the  two  facets  of  the  content  domain  already  Identi¬ 
fied,  that  Is,  training  achievement  and  job  performance.  The  question  may 
also  be  extended  to  explore  the  relevance  of  items  under  different  circum¬ 
stances  or  scenarios  (o.g.,  peacetime,  readiness,  and  combat).  How  this  was 
accomplished  Is  described  later  In  this  section. 


Test  Development  Procedure 

The  principal  steps  In  the  construction  of  the  training  achievement 
tests  were  as  follows: 

1.  Development  of  the  Initial  Item  pool 

2.  Review  by  job  Incumbents 

3.  Review  by  school  trainers 

4.  Administration  to  trainees 

5.  Preparation  of  the  Item  pools  for  administration  to  job  Incumbents 

6.  Administration  to  job  Incumbents  (Field  Tests,  Batches  A  and  B  only) 

7.  Review  by  TRADOC  proponent  agencies 

C.  Preparation  of  the  Item  pools  for  administration  to  job  Incumbents 
in  the  Concurrent  Validation. 

Each  of  these  steps  Is  described  briefly  below.  Although  each  test  went 
through  many  revisions  during  this  process,  It  Is  perhaps  easiest  to  think  In 
terms  of  the  three  versions  shown  in  Figure  II1.1:  (a)  the  initial  Item 

pool,  (b)  the  version  administered  to  Incumbents  In  the  field  test,  and  (c) 
the  version  administered  to  Incumbents  In  the  Concurrent  Validation.  Figure 
III.l  also  summarizes  the  differences  In  developmental  procedures  between 
Batches  A/13  and  Catch  Z. 
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Procedures  were  modified  somewhat  on  the  basis  of  experience  as  the  ; 

tests  were  developed.  For  example,  all  item  pools  were  reviewed  by  groups  of 
SMEs.  However,  after  the  first  few  group  reviews,  it  was  apparent  that  a 
preliminary  review  by  one  SME  for  accuracy,  correct  use  of  technical 
language,  currency,  and  appropriateness  could  greatly  facilitate  the  group 
review.  Accordingly,  this  step  was  introduced  in  the  process.  Similarly,  as 
reviews  progressed,  a  concern  for  racial  and  gender  balance  within  SME  groups 
led  to  the  development  and  Implementation  of  guidelines  for  taking  racial  and 
gender  aspects  into  account  in  assigning  SMEs  to  review  groups.  A  second 
Informal  review  was  scheduled  for  all  items  that  had  been  reviewed  before  the 
■implementation  of  the  guidelines. 


Development  of  the  Initial  Item  Pool 

Development  of  the  item  pool  for  each  MOS  involved  three  elements: 
refinement  of  the  AOSP  task  list,  calculation  of  a  test  item  budget,  and  item 
drafting  Itself. 

Refinement  of  the  AOSP  Task  list.  The  Army  Occupational  Survey  Program 
uses  a  questionnaire  checklist  of~ several  hundred  Items  to  survey  job 

incumbents  about  specific  job  tasks  that  they  do  or  do  not  perform.  Related 
tasks  are  combined  Into  duty  areas  on  the  basis  of  expert  judgment  by  the  job  j 

proponent.  The  number  of  duty  areas  In  each  of  the  1°  MOS  Included  in  the  j 

present  study  ranged  from  15  to  23.  One  of  the  key  statistics  reported  as  i 

part  of  the  AOSP  Is  the  percentage  of  soldiers  at  different  skill  levels  who 
are  performing  the  task  activity.  As  described  below,  this  statistic  was 
used  to  prepare  a  test  item  budget  prior  to  drafting  items. 

Before  the  AOSP  reports  were  used,  however,  several  actions  were  taken 
to  refine  the  item  information.  For  Batches  A  and  B,  the  AOSP  task  lists 

we re  edited  as  follows: 

•  Ninety-nine  percent  confidence  Intervals  were  computed  for  the  mean 
percentage  performing  all  tasks.  Tasks  with  a  very  low  percent 
performing  (equal  to  or  less  than  the  lower  boundary  of  the 
confidence  interval)  were  deleted  from  consideration. 

•  The  remaining  task  statements  were  then  renewed  by  four  to  six 
SMEs  (experienced  NCOs  in  that  MOS)  to: 

--  Delete  AOSP  statements  for  any  of  three  reasons:  They  were  no 
longer  part  of  the  job  due  to  changes  in  doctrine  or  equipment; 
they  were  not  really  tasks,  and  should  not  have  been  Included  ir, 

the  AOSP  listing  (e.g.,  administrative  labels  that  were 

misconstrued  as  tasks);  or  they  were  sets  of  tasks  (i.e.,  they 
contained  only  individual  tasks  that  were  already  in  the  domain). 

--  Confirm  the  project  staff's  grouping  of  AOSP  task  statements 
into  the  task  specified  in  the  Soldier's  Manual. 

Calculation  of  Item  Budgets.  To  ensure  that  tne  content  of  item  pools 
were  representative  of  tasks  performed  and  that  it  covered  the  entire  MOS 
rather  than  aspects  easiest  to  write  items  about,  an  item  budget  was  drafted 
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based  on  the  duty  areas  into  which  the  AOSP  survey  is  divided.  As  noted 
above,  the  number  of  duty  areas  in  the  19  AOSP  surveys  analyzed  ranges  from 
15  to  23.  It  was  expected  that  during  tryout,  revision,  and  field  testing, 
items  would  be  eliminated  from  the  pool  because  of  faulty  construction  or 
lack  of  discriminatory  or  predictive  power.  To  allow  for  item  attrition,  the 
initial  target  was  225  draft  items  for  each  MOS,  even  though  the  final 
version  of  the  test  was  expected  to  be  closer  to  150  items.  Survey  data  on 
percentage  performing  were  used  in  building  the  budget  as  described  below. 

Step  1— Determine  the  match  between  AOSP  duty  areas  and  training 
objectives.  A  matrix  was  prepared  to  display  the  duty  areas  of  the  AOSP 
versus  the  subdivisions  of  the  POI,  each  of  which  covers  a  number  of  training 
objectives.  Three  outcomes  were  possible:  (a)  some  duty  areas  matched  Army 
training  lessons  completely;  (b)  some  duty  areas  did  not  match  any  training 
lesson;  (c)  some  training  lessons  did  not  match  any  duty  area.  The  majority 
of  the  item  budget,  200  Items,  was  allocated  to  the  first  two  categories. 

Step  2--D1str ibute  the  first  200  items.  To  determine  a  target  number  of 
items  for  each  duty  area,  the  200  items  budgeted  to  the  job  performance 
domain  were  distributed  across  the  duty  areas  In  proportion  to  the  mean 
percentage  of  Incumbents  reported  by  the  AOSP  as  performing  the  tasks  that 
composed  the  duty  area. 

Within  each  of  the  AOSP  duty  areas,  items  were  budgeted  in  proportion  to 
how  much  they  were  emphasized  in  training:  The  greater  the  overlap  between 
the  AOSP  tasks  (within  a  duty  area)  and  the  training  objectives  (within  the 
POI),  the  more  items  were  written  to  represent  job/training  content. 

The  remaining  Items  (out  of  the  original  200)  were  assigned  to  job-only 
content.  For  example,  if  20  items  were  assigned  to  a  duty  area  and  the  duty 
area  had  a  total  of  eight  tasks,  six  of  which  matched  objectives  on  the  POI, 
then  15  (6/8  x  20)  training/job  Items  and  5  job-only  items  (15  +  5  ■  20) 
would  be  written  for  the  task. 

Step  3--Distribute  the  remaining  items  (25  or  fewer).  The  remainder  of 
the  item  budget  for  a  given  MOS  was  reserved  for  items  not  related  to  any 
area  of  the  AOSP  task  list,  but  covering  training  content  as  defined  by  the 
POI.  Within  the  portion  of  the  training  performance  domain  that  did  not 
match  any  portion  of  the  job  performance  domain,  the  allocation  of  test  items 
was  based  on  the  amount  of  training  time  devoted  to  particular  content. 

Drafting  of  Items.  After  item  budgets  were  established,  written 
materials  dealing  with  job  training  activities  were  examined  for  information 
that  could  be  transformed  into  multiple-choice  test  items.  Four  sources  were 
used:  the  ACSP  task  lists,  training  materials  ( P 0 1 s „  lesson  plans,  lesson 
guides,  etc.),  technical  publications  (Army  regulations,  Technical  Manuals, 
Field  Manuals,  etc.),  and  the  Soldier's  Manual  for  each  H0S.  The  Soldier's 
Manual  is  a  description  of  the  tasks  that  each  MOS  holder  is  to  have  mastered 
to  be  considered  qualified  at  a  given  skill  level. 

Using  these  various  documents  and  the  item  budgets,  multiple-choice 
items  were  drafted  for  all  MOS.  The  item-writing  group  included  the  research 
staff,  a  retired  Army  lieutenant  colonel,  and  other  contract  item-writers. 


Review  by  Job  Incumbents 

To  prepare  the  Item  pool  for  review  by  a  panel  of  job  incumbents,  the 
pool  was  first  reviewed  by  one  subject  matter  expert,  usually  a  senior 
officer,  who  purged  the  Item  pool  of  Its  more  glaring  faults.  The  Items  were 
then  reviewed  by  job  Incumbents,  which  required  a  series  of  site  visits.  On 
each  visit,  Incumbents  reviewed  the  items  for  technical  accuracy  and 
appropriate  vocabulary,  and  rated  Item  content  for  Importance  and  relevance 
to  Skill  Level  1  soldiers. 

The  entire  llne-u;1  of  SMEs  for  the  various  review  stages  Is  shown  in 
Table  III. 2.  Analysis  Indicated  that  minority  groups  were  adequately 
represented  In  the  SME  samples.  For  example,  Table  III. 3  shows  the  expected 
and  observed  frequencies  of  SMEs  by  race,  compared  to  the  percentage  of 
active  duty  soldiers  In  the  Army  In  each  racial  category. 

Table  III. 2 

Humber  of  Subject  Matter  Experts  Participating  In  Training 
Achievement  Test  Reviews,  and  Locations  of  Reviews 


Batch  A 
13B 
64C 
71L 
95B 

Batch  B 
1  IB 
19E 
31C 
63B 
91A 

Batch  Z 
12B 
16S 
111 
5  IB 
54E 
55B 
67  N 
76W 
76V 
94B 


Refinement  of 
Task  L 1st 


Location 


Fort  Ord 
Fort  Ord 
Fort  Ord 
Fort  Ord 


Job  Incumbent 
evlew 


School  Trainer 
Review 


o 


Fort  Ord  5 
Fort  H.  Liggett  5 
Fort  Ord  5 
Fort  Ord  5 
Fort  Ord  5 


Fort  Ord 
Fort  Ord 
Fort  Ord 
Fort  Ord 
Fort  Ord 
Fort  Ord 
Fort  Ord 
Fort  Ord 
Fort  Ord 
Fort  Ord 


Location 

SME 

Location 

Fort 

Ord 

7 

Fort 

sin 

Fort 

Ord 

6 

Fort 

Dlx 

Fort 

Ord 

6 

Fort 

Jackson 

Fort 

sm/Dix 

10 

Fort 

McCl  el  lan 

Fort 

Ord 

6 

Fort 

Bennlng 

Fort 

H„  Liggett 

6 

Fort 

Knox 

Fort 

Ord 

6 

Fort 

Gordon 

Fort 

Ord 

6 

Fort 

Dlx 

Fort 

Ord 

6 

Fort 

Sam  Houston 

Fort 

Lewis 

6 

Fort 

L.  Vlood 

Fort 

Lewis 

6 

Fort 

B1  Iss 

Fort 

Lewis 

6 

Redstone  Arsenal 

Fort 

Lewis 

4 

Fort 

L.  Wood 

Fort 

Lewis 

5 

Fort 

McCl  ellan 

Fort 

Lewi  s 

5 

Redstone  Arsenal 

Fort 

Lewis 

6 

Fort 

Rucker 

Fort 

Ord 

6 

Fort 

Lee 

Fort 

Ord 

6 

Fort. 

Lee 

Fort 

5111/Dlx 

10 

Fort 

McClellan 

Fort 

Knox 

1C 

Fort 

Knox 
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Table  I II. 3 


Distribution  of  Soldiers  In  Four  Race  Categories, 
Amy-Wide  and  Among  Subject  Hatter  Expert  Reviewers 
for  Training  Achievement  Tests 


Army-Wide 

Expected 

Observed 

Percent 

Frequency 

Frequency 

Race 

Active  Duty® 

In  SME  Sample 

In  SME  Sample 

Caucasian 

61,8 

142.8 

121 

Black 

30.5 

70.4 

74 

Hispanic 

4.0 

9.2 

33 

Other 

3.7 

8.6 

3 

231 

“Source:  Dr.  Mark  J.  Eltelberg,  personal  communication. 


Item  Quality.  To  establish  the  technical  accuracy  and  appropriateness 
of  the  draft  Items,  job  incumbents  were  asked: 

•  Would  the  Item  be  clear  to  someone  taking  the  test? 

•  Is  the  option  Indicated  really  the  correct  answer? 

•  Is  there  more  than  one  correct  option? 

e  Are  the  dlstractors  realistic  and  believable? 

•  Is  each  technical  term  commonly  used  and  easily  understood? 

•  Are  there  other  commonly  used  terms  that  should  be  Included  to  make 
the  questions  clearer? 

Items  were  then  revised  In  accordance  with  the  responses  from  Incumbents. 

Importance  Ratings.  Incumbents  were  next  asked  to  rate  the  Importance 
of  each  Item  In  three  different  contexts:  combat  (Scenario  1),  combat 
readiness  (Scenario  2),  and  garrison  duty  (Scenario  3).  The  scenarios  used 
to  describe  these  three  contexts  are  shown  In  Figure  III. 2.  A  5-polnt  scale 
ranging  from  "Very  Important"  (5)  to  "Of  Little  Importance"  (1)  was  used  to 
collect  Importance  ratings. 

Table  III. 4  shows  the  mean  Item  Importance  under  each  of  the  three 
scenarios.  Also  shown  are  the  Interrater  reliabilities  for  the  pooled 
ratings.  The  "Item  pool"  is  defined  as  those  Items  that  were  taken  to 
Incumbents  for  the  Importance  rat1ngs--that  Is,  the  first  version  of  the 
test. 


1)  Your  unit  Is  assigned  to  a  U.S.  Corps  In  Europe.  Hostilities  have  broken 
out  and  the  Corps  combat  units  are  engaged.  The  Corps  mission  Is  to 
defend,  then  reestablish,  the  host  country's  border.  Pockets  of  enemy 
alrborne/hel Iborne  and  guerilla  elements  are  operating  throughout  the 
Corps  sector  area.  The  Corps  maneuver  terrain  is  rugged,  hilly,  and 
wooded,  and  weather  Is  expected  to  be  wet  and  cold.  Limited  Initial  and 
reactive  chemical-strikes  have  been  employed  but  nuclear  strikes  have  not 
been  Initiated.  Air  parity  does  exist. 

2)  Your  unit  Is  deployed  to  Europe  as  part  of  a  U.S.  Corps.  The  Corps 
mission  is  to  defend  and  maintain  the  host  country's  border  during  a 
period  of  Increasing  International  tension.  Hostilities  have  not  broken 
out.  The  Corps  maneuver  terrain  Is  rugged,  hilly,  and  wooded,  and 
weather  Is  expected  to  be  wet  and  cold.  The  enemy  approximates  a 
combined  arms  army  and  has  nuclear  and  chemical  capability.  Air  parity 
does  exist.  Enemy  adheres  to  same  environmental  and  tactical  constraints 
as  does  U.S.  Corps. 

3)  Your  unit  Is  stationed  on  a  post  In  the  Continental  United  States.  The 
unit  has  personnel  and  equipment  sufficient  to  make  It  mission  capable 
for  training  and  evaluation  and  installation  support  missions.  The 
training  cycle  Includes  periodic  field  exercises,  command  and  maintenance 
Inspections,  ARTEP  evaluations,  and  Individual  soldier  tralnlng/SQT 
testing.  The  unit  participates  In  post  installation  responsibilities 
such  as  guard  duty  and  grounds  maintenance  and  provides  personnel  for 
ceremonies,  burial  details,  and  training  support  to  other  units. 


Figure  III. 2 


Alternative  scenarios  used  for  judging  Importance 
of  tasks  and  Items  for  training  achievement  tests 


Mean  Item  Importance  Ratings  by  Job  Incumbents  for  Three  Scenarios 
(Initial  Item  Pool  for  Training  Achievement  Tests) 
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Two  points  about  the  ratings  data  are  worth  highlighting.  First,  a 
relatively  large  percentage  of  the  Items  were  rated  as  very  Important. 
Second,  when  Importance  ratings  under  the  three  scenarios  are  compared,  a 
lower  percentage  of  Items  were  rated  as  very  Important  when  using  the  combat 
scenario  than  when  using  the  garrison  scenario  (33.1  vs.  43. IX)  and  a  higher 
percentage  of  items  were  considered  to  be  of  little  Importance  (22.8  vs. 
11.2%).  A  2  x  2  contingency  table  comparing  Item  frequencies  (Garrison  & 
Combat  vs.  Rating  1  &  5)  yields  a  chi-square  of  224.09,  £  -  .004. 

Mean  Interrater  reliabilities  were  reasonably  high  for  the  combat  and 
combat  readiness  scenarios,  .74  and  .71  respectively,  but  somewhat  lower  for 
the  garrison  scenario,  .60. 

Relevance  Ratings.  To  establish  the  relevance  of  the  draft  test  Items, 
Incumbents  were  asked,  "Do  Skill  Level  1  personnel  In  this  MOS  need  to  use 
this  knowledge  on  the  job?"  It  was  recognized  that  an  MOS  comprises  many 
Jobs,  or  duty  positions,  and  that  Incumbents  In  different  duty  positions 
might  disagree  about  Item  relevance  because  they  defined  the  job  different¬ 
ly.  The  procedure  followed  was  to  favor  inclusion.  If  any  one  respondent  In 
the  group  asserted  that  the  knowledge  was  required  for  Job  performance,  then 
the  Item  was  flagged  as  job-relevant. 

A  by-product  of  the  total  review  was  the  Identification  of  tasks  or  duty 
areas  that  were  not  Included  In  the  AOSP  data  but  were  part  of  Incumbents' 
responsibilities  or  that  were  Included  In  the  AOSP  report  but  were  no  longer 
part  of  the  MOS.  Some  Items  were  drafted  for  the  former  category  after  the 
site  visit.  To  maintain  the  relative  distribution  of  Items  across  duty 
areas,  additional  Items  were  also  drafted  to  replace  discarded  Items. 


Review  by  School  Trainers 

The  item  pool  was  also  reviewed  by  trainers  at  one  of  the  training  sites 
for  the  MOS.  As  with  the  review  by  job  Incumbents,  t'1'1  trainers  reviewed 
Items  for  technical  accuracy  and  appropriate  vocabulary,  and  rated  Item  con¬ 
tent  for  importance  and  relevance.  It  was  during  such  site  visits  that  pilot 
tests  were  conducted  with  trainees,  as  described  In  the  next  subsection. 

To  obtain  a  measure  of  Item  Importance  from  the  trainers'  point  of  view, 
SMEs  were  given  the  following  instructions: 

Look  at  each  of  the  test  questions  and  ask  yourself  how 
Important  It  Is  that  a  trainee  In  the  course  learn  the 
knowledge  represented  by  this  question. 

Trainers  used  the  same  scale  as  Incumbents  to  rate  item  Importance.  Table 
III. 5  shows  the  mean  rating  for  Items  in  the  item  pool.  The  table  also 
contains  Interrater  reliabilities  for  all  MOS. 

Overall,  trainers  tended  to  rate  items  significantly  higher  than  did 
Incumbents.  Mean  Importance  rating  by  trainers  for  the  pool  was  4.18  (median 
»  4.03)  while  the  mear.  of  the  means  across  scenarios  for  Incumbents  on  the 
Initial  Item  pool  was  3.52  (median  «  3.58)  (Wllcoxon  1  =  3.38,  £  ■  .001;. 


This  same  trend  appears  in  the  proportions  of  items  rated  very  important  and 
of  little  importance.  Trainer's  rated  a  mean  of  54. 45!  of  the  items  in  the 
Item  pool  as  very  Important,  compared  with  Incumbents  who  gave  a  rating  of 
very  Important  to  33.155  of  the  Items  on  the  combat  scenario  and  43.155  of  the 
Items  on  the  garrison  scenario.  Incumbents  rated  22.855  of  the  Items  as  being 
of  little  Importance  on  the  combat  scenario  and  11.255  on  the  garrison 
scenario}  trainers,  however,  rated  only  4.115  of  the  items  as  of  little 
Importance. 


Table  III.5 

Mean  Item  Importance  of  Ratings  by  Trainers 
(Initial  Item  Pool  for  Training  Achievement  Tests) 


Number 

Mean 

Interrater  1 

1  M0S 

Raters 

Items 

Rating5 

Reliability  | 

i  Batch  A  § 

ft  13B 

6 

297 

4.4 

.72 

K  64C 

7 

215 

4.2 

.78 

K  7 1L 

5 

122 

3.8 

.95  : 

K  S5B 

5 

122 

4.7 

.50  ; 

I  Batch  B  1 

»  11B 

7 

200 

3.8 

.52 

P  19E 

6 

214 

4.0 

.64 

M  31C 

6 

192 

3.7 

.69  i 

A  63B 

6 

238 

3.3 

.61 

I  91A/B 

6 

299 

3.9 

.81 

S  Batch  Z  3 

J3  1 2F- 

6 

221 

4.0 

.87  j> 

iv  1GS 

G 

v 

208 

4.1 

.61  l] 

ft  27E 

6 

219 

4.0 

.73  4 

if  5  7  B 

4 

218 

4.8 

.57  « 

n  54E 

6 

220 

3.9 

.66  y 

S  5  5  B 

6 

227 

4.7 

.32 

K  67  N 

4 

215 

4.5 

.18  5 

k  76W 

3 

214 

4.7 

.00  a 

$  76  Y 

6 

132 

5.0 

.32  g 

g  94B 

6 

200 

3.8 

.68 

|  19K 

6 

202 

4.1 

.75  ij 

"m 


aRated  on  a  five-point  scale  from  "Of  Little  Importance"  (1)  to  "Very 
Important"  (5), 
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Mean  trainer  interrater  reliability  across  MOS  was  .58  (median  a  .62). 
This  compares  with  a  mean  of  .67  for  incumbents  (median  ■  .70)  across  all 
three  scenarios. 

To  establish  the  relevance  of  the  draft  test  Items  to  training,  trainers 
were  asked  the  following: 

Can  trainees  be  expected  to  have  the  knowledge  represented 
In  the  Items  as  a  result  of  training? 

As  with  job  relevance,  the  procedure  favored  Inclusion.  If  any  one  of 
the  trainers  responded  affirmatively,  then  the  item  was  flagged  as  training¬ 
relevant.  At  this  point,  relevance  data  were  available  for  all  Items  with 
respect  to  the  Job  alone  (from  SME  Incumbents),  training  alone  (from  SME 
trainers),  or  both.  Where  the  two  judgments  overlapped,  Items  were  con¬ 
sidered  relevant  to  both  job  and  training.  Items  added  In  subsequent 
revisions  afteh  these  judgments  were  made  were  not  rated  for  relevance. 

Table  III, 6  Is  based  on  relevance  data  obtained  from  job  Incumbents  and 
from  trainers  and  shows  the  distribution  of  the  various  classes  of  Items  for 
each  MOS  on  the  pilot  test  administered  to  trainees  In  the  schools.  The  Not 
Rated  category  consists  of  Items  added  to  the  pool  after  relevance  ratings 
had  been  collected.  Percentages  have  been  computed  for  the  Job-Only, 
Training-Only,  and  Job-and-Tralnlng  categories,  using  the  total  of  these 
three  as  the  divisor. 

As  would  be  expected,  many  more  Items  were  rated  as  Job-and-Tralnlng 
(2,843  or  75.5%)  than  as  either  Job  Only  (676  or  17.9%)  or  Training  Only  (249 
or  6.6%).  Also,  there  are  substantial  differences  In  the  range  of  Items  In 
these  throe  categories.  Of  particular  Interest  Is  the  comparison  between  Job 
Only  (range  ■  0-78)  and  Training  Only  (range  ■  0-140).  The  large  range  for 
Training  Only  Is  accounted  for  solely  by  MOS  9 1A ;  without  this  one  MOS  the 
range  would  be  0-91,  MOS  91A  Is  the  designation  for  Medical  Specialists,  and 
incumbents  appear  to  believe  that  many  items  which  trainers  consider  relevant 
are  not  relevant  to  the  job. 

Given  the  doctrinal  emphasis  on  relating  training  to  the  job.  It  Is  not 
surprising  that  (with  the  exception  of  MOS  91A)  few  Items  were  rated  as 
Training  Only,  despite  the  effort  on  the  part  of  the  Item  writers  to  create 
such  Items. 


Administration  to  Trainees 

After  review  by  job  Incumbents  and  trainers,  test  Items  were  adminis¬ 
tered  to  groups  of  trainees  In  their  last  week  of  training.  A  sample  of 
trainees  was  also  Interviewed  after  the  test  to  obtain  Information  about  the 
clarity  and  comprehensibility  of  the  Items.  Specific  questions  Included  the 
foil  owing: 

•  Did  you  have  any  difficulty  understanding  the  question?  Were  there 
any  words  or  phrases  which  were  difficult  to  understand? 
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Table  III. 6 


Number  and  Percentage  of  Items  Rated  Relevant  to  Job  and  Training 
(Initial  Item  Pool  for  Training  Achievement  Tests) 


Job  Only 

Training 

Only 

Job 

and 

Training 

Not 

Relevant 

Not 

Rated# 

MOS 

nr 

X 

nr 

1 

nr 

N 

N 

Batch  A 

13B 

70 

41.4 

5 

3.0 

94 

55.6 

6 

62 

64C 

78 

36.8 

0 

0.0 

134 

63.2 

0 

16 

71L 

42 

34.4 

4 

3.3 

76 

62.3 

0 

0 

95B 

64 

31.5 

8 

3.9 

131 

64.5 

11 

20 

Batch  B 

1  IB 

68 

39.5 

14 

8.1 

90 

52.3 

21 

25 

19E 

32 

16.2 

9 

45.7 

156 

79.2 

2 

5 

31C 

47 

26.3 

15 

8.4 

117 

65.4 

5 

8 

63B 

48 

23.0 

8 

3.8 

153 

73.2 

2 

4 

91A 

0 

0.0 

140 

54.9 

115 

45.1 

5 

5 

Batch  2 

12B 

7 

3.4 

0 

0.0 

197 

96.6 

0 

23 

IBS 

11 

5.4 

0 

0.0 

191 

94.6 

0 

6 

27E 

1 

0.5 

19 

9.3 

185 

90.2 

0 

15 

51B 

0 

0.0 

0 

0.0 

202  100.0 

0 

16 

54  f 

0 

0.0 

1 

0.5 

207 

99.5 

0 

15 

55B 

0 

0.0 

6 

2.4 

206 

97.6 

0 

16 

67N 

1 

0.5 

0 

0.0 

208 

99.5 

0 

B 

76W 

68 

31.8 

12 

5.6 

134 

62.6 

0 

0 

76Y 

78 

39.2 

0 

0.0 

121 

60.8 

0 

1 

94B 

61 

31.1 

9 

4.6 

126 

64,3 

-1 

2 

Total 

676 

17.9 

249 

6.6 

2,843 

75.5 

60 

247 

#Items  added  to  the  pool  after  relevance  ratings  had  been  collected. 


o  Do  you  agree  with  the  correct  answer?  Is  there  a  better  way  to  state 
the  answer? 

o  (For  Items  derived  from  tasks  performed  In  training)  Is  It  necessary 
to  know  the  answer  to  this  question  to  perform  the  task  In  training? 


if 

V 

V 

t 

1 

I 


1 


The  results  of  this  test  administration  to  trainees  are  shown  in 
Table  III. 7.  All  results  shown  are  based  on  Items  relevant  to  training,  that 
Is,  Job-and-Trainlng  and  Training-Only  items.  Items  relevant  only  to  the  job 
are  not  Included, 


Table  III. 7 

Results  From  Training  Achievement  Tests  Administered  to  Trainees 


MOS 

Number 

of 

Subjects 

Number 

of 

Items 

Mean 

Number 

Correct 

10. 

Range 

Alpha 

Mean 

Percent 

Correct 

Batch  A 

13B 

50 

104 

54.4 

10.2 

44 

.81 

52.3 

640 

50 

•  130 

69.0 

13.7 

60 

.87 

53.1 

71L 

70 

71 

39.3 

7.4 

31 

.79 

55.3 

95B 

50 

105 

69.6 

10.6 

46 

.85 

66.2 

Batch  B 

1  IB 

51 

111 

53.4 

13.7 

74 

.91 

48.1 

19E 

50 

169 

102.0 

1  -.4 

86 

.92 

60.4 

3 1C 

49 

135 

78.3 

.4.6 

71 

.90 

58.0 

63B 

60 

162 

67.1 

19.8 

78 

.92 

41,4 

91A 

49 

255  • 

128.1 

40.4 

201 

.97 

50.2 

Batch  Z 

12B 

50 

214 

118.1 

16.6 

78 

.88 

55.4 

16$  • 

71 

197 

120.0 

19.0 

112 

.91 

60.9 

27E 

43 

219 

131.3 

21.  b 

102 

.92 

59.9 

51B 

50 

218 

120.5 

22.0 

107 

.93 

55.2 

54E 

46 

220 

131.6 

19.8 

75 

.91 

59.6 

55B 

48 

227 

153.6 

21.6 

101 

.92 

67.7 

67N 

47 

214 

122.5 

19.9 

108 

.91 

57.3 

76W 

32 

146 

67.1 

15.1 

58 

.89 

46.0 

7GY 

50 

122 

68.8 

19.0 

84 

.94 

56.1 

94B 

45 

168 

76.7 

18.2 

74 

.90 

45.6 

When  tests  were  administered  In  the  schools,  the  targeted  number  of 
subjects  was  50  at  each  school.  The  actual  number  to  whom  the  tests  were 
administered  ranged  from  3?  for  MO$  76W  to  71  for  MOS  16S;  the  mean  was 
50.1.  The  mean  for  coefficient  alpha  was  .90. 


An  index  of  difficulty  was  computed  by  dividing  the  mean  number  of  items 
correct  by  the  number  of  Items,  that  is,  the  percentage  of  Items  on  a  test 
that  were  correct  on  average.  This  percentage  ranged  from  41.4  for  MOS  63B 
to  67.7  for  MOS  55B.  The  mean  percentage  correct,  was  54.5. 


111-23 


I 


m 


xktu  x»xy  %* 


>r 


i 


Preparation  of  Batch  A  and  Batch  B  Training  Achievement  Tests  for  Field 
tests  With  Job  Incumbents' 

After  all  the  SHE  judgments  were  made  and  trainee  tryouts  completed,  the 
items  were  revised  In  accordance  with  the  SHE  and  trainee  comments.  For  the 
Batch  A  and  Batch  B  MOS,  the  Item  pools  were  prepared  for  administration  to 
job  Incumbents  In  the  criterion  field  tests.  Data  from  the  field  test 
administration  were  later  used  (along  with  data  from  the  administration  of 
the  Items  to  trainees,  relevance  data,  and  Importance  data)  to  convert  the 
pools  of  draft  Items  Into  the  standardized  training  knowledge  tests. 

As  the  Item  pools  were  cut  and  items  added  or  changed  In  these  early 
test  construction  steps,  the  descriptive  characteristics  of  the  overall 
pool— that  Is,  Importance  and  relevance— Inevitably  changed  as  well.  Items 
were  dropped  If  they  were  judged  to  be  of  little  Importance  or  no  relevance. 
The  nature  of  the  item  budget  was  preserved  by  adding  new  Items  If 
necessary.  The  characteristics  of  the  field  test  versions  In  terms  of 
Importance  and  relevance  are  reported  In  the  following  subsection.  These 
data  parallel  those  reported  for  the  Initial  item  pools. 


SHE  Importance  and  Relevance  Ratings:  Field  Test  Version 

Table  III. 8  shows  the  number  of  Items,  mean  Item  Importance  rating  for 
the  three  scenarios  by  job  Incumbents,  and  Incumbent  interrater  reliability 
for  the  field  test  versions  of  the  tests.  Since  the  field  tests  Included 
only  Batches  A  and  R,  the  data  reported  are  for  9  of  the  19  MOS.  Host  of 
these  tests  were  culled  of  Items  prior  to  the  field  test  and  are  consequently 
shorter  than  tests  In  the  Item  pool.  The  basis  for  the  culling  has  already 
been  described.  In  addition,  prior  to  the  field  test  some  Items  were  added 
on  which  Importance  data  had  not  been  collected  and  for  which  no  Importance 
ratings  were  available. 

As  would  be  expected,  the  pattern  of  Importance  ratings  across  scenarios 
by  job  Incumbents  was  little  affected  by  the  culling  procedure.  There  was 
also  little  difference  In  the  mean  across  all  scenarios  between  the  Item  pool 
(3.40)  and  field  test  versions  (3.43). 

Table  III. 9  shows  the  ratings  by  trainers  on  the  average  Importance  of 
Items  retained  for  the  field  test.  The  table  also  contains  trainer 
Interrater  reliabilities  for  Catches  A  and  B. 

As  expected  for  the  culled  tests,  mean  Importance  ratings  were  very 
slightly  higher  for  field  test  versions  than  for  the  Item  pools  for  both 
Incumbents  (3.43  vs.  3.40)  and  trainers  (4.02  vs.  3.97).  As  already 
discussed  In  connection  with  the  Item  pool,  trainers  rated  Item  Importance 
higher  overall  than  did  Incumbents. 
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Table  III.9 


Mean  Item  Importance  Ratings  by  School  Trainers 
(Field  Test  Version  of  Training  Achievement  Tests) 


Number 


MOS 

Raters 

Items 

Mean  Item 

Importance 

Rating3 

Interrater 

Reliability 

Batch  A 

138 

6 

152 

4.40 

.73 

64C 

7 

145 

4.24 

.78 

71L 

5 

122 

3.78 

.95 

958 

5 

90 

4.72 

,50 

Batch  B 

MB 

7 

144 

3.93 

.41 

19E 

6 

202 

4.06 

.62 

31C 

6 

187 

3.70 

.62 

63B 

6 

216 

3.44 

.48 

91A 

6 

260 

3.95 

.78 

aRated  on  a  five-point  scale  from  "Of  Little  Importance"  U)  to  "Very 
Important"  (5). 


Table  III. 10  contains  the  relevance  data  for  the  version  of  the  test 
administered  to  Incumbents  In  the  field  tests.  The  distribution  across 
reievance  categories  Is  similar  to  that  noted  In  connection  with  the  pilot 
test  version  In  Table  III. 6. 


Field  Test  Instruments 


At  this  stage  the  nine  training  achievement  tests  for  the  MOS  In  Batch  A 
and  Batch  B  were  deemed  ready  for  field  testing  with  job  Incumbents.  The 
field  test  procedure  Is  described  In  Section  8,  and  the  field  test  results 
and  the  subsequent  modification  of  the  tests  are  described  In  Section  9. 

Up  to  this  point  the  10  tests  for  the  10  MOS  In  Batch  Z  followed  the 
same  developmental  steps  as  for  the  tests  In  Batches  A  and  B.  However,  as 
noted  previously,  the  Batch  Z  Instruments  were  not  field  tested  with  job 
Incumbents.  Consequently,  the  Concurrent  Validation  versions  of  these  10 
tests  retain  more  Items  than  do  the  9  A/B  tests.  Additional  Item  analyses 
will  be  carried  out  for  Batch  Z  on  the  basis  of  the  data  from  the  Concurrent, 
Validation  sample. 

Copies  of  the  19  MOS  tests  as  used  In  Concurrent  Validation  are 
contained  In  the  ARI  Research  Note  under  preparation. 
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Table  III. 10 


Number  and  Percentage  of  Items  Rated  Relevant  to  Job  and 
Training  {Field  Test  Version  of  Training  Achievement  Tests) 


Job  Only 

Training 

Only 

Job 

and 

Training 

Not 

Relevant 

Mot 

Rate 

MOS 

N 

r 

r 

N 

T 

ft 

FT 

Batch  A 

13B 

70 

41.2 

5 

2.9 

95 

55.9 

6 

59 

54C 

80 

37.2 

0 

0.0 

135 

62.8 

0 

13 

711 

42 

34.4 

4 

3.3 

76 

62.3 

0 

8 

95B 

64 

31.5 

8 

3.9 

131 

64.5 

11 

20 

Batch  B 

1 1 B 

68 

39.5 

14 

3.1 

90 

52.3 

21 

26 

19E 

32 

16.2 

9 

4.6 

156 

79.2 

2 

5 

31C 

47 

26.3 

15 

8.4 

117 

65.4 

5 

20 

6  3  B 

48 

23.0 

8 

3.8 

153 

73.2 

2 

8 

91A 

0 

0.0 

140 

54.9 

115 

45.1 

5 

5 

Total 


451  26.2 


203  11.8  1068  62.0 


52 


164 


Section  3 

DEVELOPMENT  OF  TASK-BASED  MOS-SPECIFIC  CRITERION  MEASURES1 


The  MOS-specIflc  criterion  measures  described  In  this  section  concern 
the  assessment  of  performance  on  a  sample  of  job  tasks  that  were  Identified 
as  representative  of  all  job  tasks  In  the  MOS.  Lie  general  procedure  was  to 
develop  a  careful  description  of  all  the  major  tasks  that  comprise  the  job, 
draw  a  sample  of  these  tasks,  and  develop  multiple  measures  of  performance  on 
each  task. 


Objectives 

The  major  objective  Is  to  develop  reliable  and  valid  task-basod  measures 
of  first-tour  performance  In  the-  nine  Batch  A  and  Batch  B  MOS.  Such  measures 
are  Intended  to  reflect  that  part  of  the  performance  domain  having  to  do  with 
Job-specific  technical  competence.  Harrfs-on  (job  sample)  tests,  paper-and- 
pencil  knowledge  tests,  and  ratings  measures  were  developed  for  each  job 
task. 


While  no  one  measure  can  be  assumed  In  advance  to  be  a  better  estimate 
of  the  job  Incumbent's  "true1,  performance,  Intercorrelations  among  the 
measures  are  of  Interest  for  what  they  tell  us  about  the  common  and  unique 
variance.  Consequently,  another  objective  Is  to  compare  alternative  measures 
In  terms  of  their  construct  validity  for  assessing  task  proficiency. 

As  noted  In  Part  I,  .vine  MOS  were  selected  for  study  (see  Table 
III. 11).  These  nine  MOS  were  chosen  to  provide  maximum  coverage  of  the  total 
array  of  knowledge,  ability,  and  skill  requirements  of  Army  jobs. 


Development  Procedure 

The  design  strategy  for  the  MOS-specIflc  measures  Involved,  for  each 
MOS,  selection  of  approximately  30  tasks  that  accurately  sampled  the  job 
domain.  The  total  number  of  tasks  was  dictated  primarily  by  time  con¬ 
straints.  While  the  time  required  to  assess  performance  on  individual  tasks 
would  differ  with  the  nature  of  the  task,  a  total  of  30  tasks  for  each  MOS 
seemed  reasonable  as  a  planning  figure. 

For  each  MOS,  all  30  tasks  would  be  assessed  with  written  knowledge 
tests.  Fifteen  of  the  30  tests  would  also  be  assessed  with  hands-on  tests. 
Finally,  task  performance  ratings  would  be  obtained  for  the  15  tasks  measured 
with  the  hands-on  job  sample  tests,  and  job  history  data  covering  recency  and 
frequency  of  performance  would  be  researched  for  all  30  tasks. 


1  This  section  is  based  primarily  on  ARI  Technical  Report  717,  Development 
and  Field  Test  of  Task-Based  MOS-SpecIfic  Criterion  Measures,  by  Charlotte 
H.  Campbell,  Roy  C.  Campbell ,  Michael  G.  Rumsey,  and  Dorothy  C.  Edwards,  and 
the  supplementary  ARI  Research  Note  in  preparation,  which  contains  the  report 
appendixes. 


Table  III. 11 


Military  Occupational  Specialties  (MOS) 
Selected  for  Criterion  Test  Development 


Batch  A 


13B  Cannon  Crewman 
64C  Motor  Transport  Operator 
71L  Administrative  Specialist 
95B  Military  Police 


Batch  B 


118  Infantryman 

19E  Armor  Crewman 

31C  Single  Channel  Radio  Operator 

63B  Light  Wheel  Vehicle  Mechanic 

91A  Medical  Specialist 


The  MOS-spedflc  task  tests  and  "the  auxiliary  Instruments  were  developed 
and  field  tested  for  the  Batch  A  MOS  before  we  began  developing  the  tests  In 
the  remaining  five  MOS  (Batch  B).  While  the  procedures  were  the  same  for 
Batch  A  and  Batch  8,  some  lessons  learned  from  Batch  A  development  were 
applied  to  Batch  B,  Across  all  nine  MOS  some  Individual  variation  was  neces¬ 
sary  because  of  particular  circumstances  In  an  MOS;  however,  variations  were 
slight.  The  general  procedure  was  composed  of  eight  major  activities; 

t  Define  task  domain, 

•  Collect  SME  judgments. 

•  Analyze  SME  judgments. 

•  Select  tasks  to  be  tested. 

t  Assign  tasks  to  test  mode. 

•  Construct  hands-on  and  knowledge  te  ts. 

t  Conduct  pilot  tests  and  make  revisions. 

t  Construct  auxiliary  Instruments. 

These  eight  major  activities  are  discussed  in  the  following  subsections. 


Definition  of  Task  Domain 


Defining  task  domain  involved  dealing  with  the  entire  population  of 
tasks  for  an  MOS,  The  job  task  descriptions  of  first-tour  (Skill  Level  1) 
soldiers  were  derived  from  three  primary  sources: 

1.  MQS-Speclflc  Soldier's  Manuals  (SM).  Each  MOS  Proponent,  the  agency 
responsible  for  prescribing  M(5s  policy  and  doctrine,  prepares  and 


publishes  a  Soldier's  Manual  that  lists  and  describes  tasks,  by 
Skill  Level,  that  soldiers  In  the  MOS  are  doctrinally  responsible 
for  knowing  and  performing.  The  number  of  tasks  varies  widely  among 
the  nine  MOS,  from  a  low  of  17  Skill  Level  1  (SL1)  tasks  to  more 
than  130  SL1  tasks. 

2.  Soldier's  Manual  of  Common  Tasks  (SMCT)  (FM  21-2,  3  October  1983). 2 
The  SfiCT  describes  tasks  that  each  soldier  In  the  Army,  regardless 
of  his  or  her  MOS,  must  be  able  to  perform.  The  1983  version  con¬ 
tains  78  SL1  tasks  and  "supersedes  any  common  tasks  appearing  in 
MOS-specific  Soldier's  Manuals"  (p.  vil). 

3.  Army  Occupational  Survey  Program  (AOSP).  The  AOSP  obtains  task 
descriptions  Ey  surveying  3^  incumbents  with  a  questionnaire 
checklist  that  Includes  several  hundred  items.  The  Items  are 
obtained  from  a  variety  of  sources  (e.g.,  the  Proponent  school),  and 
include  and  expand  the  doctrinal  tasks  from  the  preceding  two 
sources.  The  AOSP  Is  administered  periodically  to  soldiers  In  all 
skill  levels  of  each  MOS  by  the  U.S.  Army  Soldier  Support  Center. 
The  analysis  of  responses  by  means  of  the  Comprehensive  Occupational 
Data  Analysis  Program  (CODAP)  provides  the  number  and  percentage  of 
soldiers  at  each  skill  leve’  who  report  that  they  perform  each 
task.  The  number  of  tasks  or  activities  in  the  surveys  for  the  nine 
MOS  of  Interest  ranged  from  487  to  well  over  800. 

While  the  above  sources  provided  the  main  input  to  the  MCS  job 
descriptions,  Proponent  agencies  were  also  contacted  directly  to  determine 
whether  other  relevant  tasks  existed.  The  number  of  additional  tasks  thus 
generated  was  not  large,  but  the  added  tasks  were  sometimes  significant.  For 
example,  the  pending  Introduction  of  new  equipment  added  tasks  that  had  not 
yet  appeared  In  the  written  documentation. 

Completion  of  the  above  process  resulted  in  a  not  very  orderly 
accumulation  of  tasks,  part  tasks,  steps,  and  activities.  To  bring  some 
order  to  this  collection,  a  six-step  refinement  process  was  conducted  for 
each  MOS. 


1.  Identify  AOSP  activities  performed  at  SL1.  The  assumption  for  this 
step  was  that  every  activity  included  in  an  AOSP  survey  that  had  a 
non-zero  response  frequency  among  SL1  soldiers,  after  allowing  for 
error  in  the  survey,  was  performed  at  SL1.  The  procedure  for 
estimating  the  error  was  to  compute  the  boundaries  of  a  confidence 
interval  about  zero.  Tasks  or  activities  with  frequencies  above  the 
confidence  Interval  boundary  were  considered  to  have  non-zero 
frequencies  and  were  retained.  The  percentage  of  tasks/activities 
dropped  from  each  AOSP  by  this  application  was  about  25%  for  each 
MOS. 


^For  Batch  A  MOS,  the  version  of  Field  Manual  21-2  in  effect  during  task 
selection  was  the  1  December  1S82  edition,  containing  71  tasks. 
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2.  Group  AOSP  statements  under  SM  tasks.  An  AOSP  questionnaire  item 

was  referenced  to  an  sto  task  if  the  Item  duplicated  the  SM  task  or 
was  subsumed  under  the  SM  task  as  a  step  or  variation  in  condi¬ 
tions,  even  if  doctrine  did  not  specifically  identify  the  activity 
as  an  SL1  responsibility. 

3.  Group  AOSP-only  statements  Into  tasks.  Since  some  AOSP  question- 
na Ire  Items  could  not  be  matched  with  any  SM  task,  the  next  step  was 
to  edit  such  AOSP  Items  so  that  they  were  similar  In  format  to  the 
SM  task  statements  but  were  still  a  clear  portrayal  of  additional 
task  content  not  contained  in  the  SM. 

4.  Consolidate  domain  (Proponent  review).  The  first  three  steps  re¬ 
sulted  Tn  a  fairly  orderly  array  of  tasks  for  each  MOS.  With  this 
task  list  it  was  feasible  to  go  to  the  Proponent  agency  for  each  MOS 
for  verification.  At  each  Proponent  a  minimum  of  three  senior  NCOs 
or  officers  reviewed  the  list  and  eliminated  tasks  that  had  been 
erroneously  included  In  the  domain.  While  specific  reasons  for 
dropping  tasks  varied  with  each  MOS,  the  general  categories  were: 

•  Tasks  specific  to  equipment  that  was  !'e1ng  changed. 

e  Tasks  eliminated  by  current  doctrine  not  yet  reflected  In  avail¬ 
able  publications. 

•  Collective  tasks  actually  performed  by  crews/squads,  platoons,  or 
even  companles/batterles. 

•  Tasks  specific  to  equipment  variations  that  should  be  combined. 

•  Tasks  specific  to  the  mission  of  the  Reserve  Component.  While 
for  most  units  there  are  no  discr Iminable  task  differences  be¬ 
tween  active  duty  and  Reserve  Component  organizations,  this  is 
not  true  for  all  MOS. 

The  full  consolidated  domain  list  of  tasks,  with  supporting  AOSP 
statements  for  each  MOS,  is  contained  In  Appendix  B,  ARI  Research 
Note  in  preparation. 

5.  Delete  tasks  that  pertain  only  to  restricted  duty  positions.  The  SM 
for  most  MOS  contains  tasks  for  individual  duty  positions" within  the 
MOS.  For  example,  the  64C  Motor  Transport  Operator  can  be  a  Dis¬ 
patcher;  the  95B  Military  Policeman  can  be  a  Security  Guard.  For 
most  duty  positions,  Incumbents  move  freely  In  and  out  of  the  posi¬ 
tion,  the  performance  of  the  duty  position  tasks  being  dependent  on 
whether  they  are  assigned  the  position  or  not.  Other  positions  are 
more  permanent.  Restricted  Duty  Positions  were  operationally  defined 
as  those  for  which  the  award  of  an  Additional  Skill  Identifier  (ASI) 
or  Special  Skill  Identifier  (SSI)  and  at  least  1  week  of  specialized 
training  were  required.  Five  duty  positions  In  two  MOS  were 
affected. 


6.  Delete  Higher  Skill  Level  (HSL)  and  AOSP-only  tasks  with  atypically 
Tow  frequencies.  The  first  step  in  this  process  was  to  translate 
A(5$P  frequencies  Into  task  frequencies.  Generally,  when  AOSP  and 
task  statements  matched,  the  AOSP  frequency  for  the  matching  state¬ 
ment  was  applied  to  the  task.  If  there  was  no  match,  the  most 
frequent  step  or  condition  was  the  basis  for  the  task  frequency. 
However,  In  some  cases,  frequencies  were  aggregated  to  account  for 
equipment  differences. 

The  general  approach  for  Identifying  low-frequency  tasks  was  to 
compare  frequency  distributions  of  the  SL1  tasks  with  the  HSL  and 
ACSP-only  tasks.  A  four-step  procedure  Identified  the  atyplcally 
Infrequent  tasks  to  be  eliminated: 

•  List  the  response  frequencies  of  SL1  tasks  from  the  AOSP/CODAP. 

•  List  the  response  frequencies  of  HSL/AOSP-only  tasks. 

•  Test  groups  (lists)  for  difference,  using  Mann-Hhltney  U  test. 

•  If  the  groups  were  different,  and  the  HSL/AOSP-only  group  had 
tasks  with  lower  response  frequencies  (which  they  did  In  all 
cases),  eliminate  those  low-frequency  tasks  until  group  dif¬ 
ferences  were  not  significant  at  .01  level. 

The  result  of  this  process  was  a  final  task  list  for  each  MOS.  It 
Included  all  SL1  MOS  and  common  tasks  with  non-zero  frequencies  (or  no  AOSP/ 
CODAP  frequency)  and  HSL/AOSP-only  tasks  performed  by  SL1  soldiers.  Table 
III. 12  shows  the  reduction  of  the  task  list  during  each  phase  and  the  reasons 
for  the  reduction  by  MOS.  The  nine  final  task  lists  are  contained  (with  data 
from  the  SME  judgments,  detailed  below)  In  Appendix  C,  API  Research  Note  In 
preparation. 


Collection  of  SME  Judgments 


After  the  MOS  domains  were  refined,  every  domain  comprised  more  than  100 
tasks.  To  select  30  representative  tasks  for  each  MOS,  more  Information  was 
needed.  MOS  Proponent  agencies  were  asked  to  provide  subject  matter  experts 
(SMEs)  regarding  the  tasks  on  the  task  list.  Requirements  were  that  they  be 
Iri  the  grade  E-G  or  above  (l.e.,  second  or  third  tour)  or  officers  In  the 
grade  0-3  (captain)  or  above.  Recent  field  experience  supervising  SL1  per¬ 
sonnel  was  an  additional  requirement.  For  Batch  A  MOS,  15  SMEs  In  each  MOS 
were  requested.  For  Patch  B,  some  modifications  were  made  In  the  review 
process  (detailed  below)  and  30  SIIEs  In  each  of  these  MOS  were  requested. 
Collection  of  SME  data  required  approximately  1  day  for  each  MOS.  The  number 
of  SMEs  obtained  for  each  MOS  and  samples  of  all  Instructions  provided  to 
SMEs  are  contained  In  Appendix  D,  ARI  Research  Note  In  preparation. 

Three  types  of  judgments  were  obtained  from  the  SMEs: 

Task  Clustering.  Each  task  was  listed  on  a  3"  x  5"  card  along  with  a 
brief  description.  SMEs  were  told  to  sort  the  tasks  Into  groups  so  that  all 
the  tasks  In  each  group  were  alike,  and  each  group  was  different  from  the 
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Effects  of  Domain  Definition  on  MOS  Task  Lists 
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other  groups.  For  the  Batch  B  MOS,  common  tasks  were  grouped  for  the  SMEs , 

based  on  the  clustering  derived  from  the  Batch  A  data.  SMEs  were  permitted  to 
add  to  or  break  up  the  groups  as  they  saw  fit. 

Task  Importance.  To  set  the  context  for  the  Batch  A  MOS  task  Importance 
judgments,  an  !3MEs  were  given  a  European  scenario  that  specified  a  high 
state  of  training  and  strategic  readiness  but  was  short  of  Involving  actual 
conflict.  After  collection  of  Batch  A  data,  concern  was  expressed  as  to  the 

scenario  effect  on  SME  judgments.  As  a  result,  for  Batch  B  MOS  three 

scenarios  were  used.  An  "Increasing  Tension"  scenario  Identical  to  that  used. 
In  Batch  A  was  retained,  and  a  "Training"  scenario  specifying  a  stateside 
environment  and  a  "Combat"  scenario  (European  non-nuclear)  were  developed. 
Sample  MOS  definitions  for  the  three  scenarios  are  given  In  Figure  III. 3. 
These  scenarios  are  similar  to  those  used  In  the  training  achievement  test 
development  (see  Section  2). 

The  SMEs  for  each  Batch  B  MOS  were  randomly  divided  Into  three  groups 
and  each  group  was  given  a  different  scenario  as  a  basis  for  judgments. 
However,  for  MOS  63B  (Light  Wheel  Vehicle  Mechanic)  only  11  SMEs  were 
available  and  a  repeated  measures  procedure  was  used;  that  Is,  each  63B  SME 
rated  task  Importance  three  times,  using  each  of  the  three  scenarios  In 

counterbalanced  order.  To  make  their  judgments,  SMEs  were  asked  to  rate  the 
Importance  of  the  task  In  performing  the  MOS  job  In  support  of  the  unit 

mission  under  the  appropriate  scenario, 

Slightly  different  procedures  were  used  *n  Batch  A  and  Batch  B.  For 
Batch  A  MOS,  the  judges  were  given  the  tasks  on  Individual  cards,  Identical 
to  those  used  In  task  clustering,  and  told  to  rank  the  tasks  from  Most 
Important  to  Least  Important.  For  Batch  B  MOS,  judges  were  provided  a 
list  of  the  tasks,  with  descriptions,  and  asked  to  rate  them  on  a  7-polnt 
scale  from  "1  *  Not  at  all  Important  for  unit  success"  to  "7  ■  Absolutely 
essential  for  unit  success." 

Task  Performance  Difficulty.  To  arrive  at  an  Indication  of  expected 
task  *JTff Iculty,  SMts  were  asked  to  sort  a  "typical"  group  of  10  soldiers 
across  five  performance  levels  based  on  how  they  would  expect  a  typical  group 
of  SLl  soldiers  to  perform  on  each  task. 

Analysis  of  SME  Judgments.  The  judgment  data  were  analyzed  and  the 
fol 1 cwlng  products  were  obtained : 

Cluster  Membership.  Task  clusters  were  Identified  by  means  of  a 
factor  analysis  of  a  cross-product  matrix  derived  from  the  SMEs1 
task  similarity  clusterings. 

Importance,  The  Importance  rank  of  each  task,  averaged  across 
sMt's ,  was  analyzed  by  computer.  For  Batch  A  MOS,  a  single 
Importance  score  was  obtained.  For  batch  B  MOS,  a  rank  ordering  of 
average  Importance  ratings  was  generated  under  each  of  the  three 
scenarios. 


Your  personnel  unit  Is  deployed  to  Europe  as  part  of  a  U.S.  Corps  during 
a  deteriorating  political  and  military  situation.  The  Corps  mission  Is 
to  defend  and  maintain  the  host  country's  border  In  the  event  that 
hostilities  escalate.  The  enemy  approximates  a  combined  arms  army  and 
has  nuclear  and  chemical  capability.  Air  parity  does  exist.  The  Corps 
has  drawn  all  equipment  and  Is  fully  operational.  In  support  of  the 
Corps  Personnel  Operations  Center,  your  unit  Is  responsible,  for 
supporting  the  functions  of  strength  accounting,  replacement  operatlc  s, 
casualty  reporting,  personnel  management,  personnel  actions,  and 
personnel  records. 

--Neutral  or  "Increasing  Tension"  Scenario  for  MOS  71L 


Your  soldiers  are  assigned  to  support  the  activities  of  the  Installation 
Provost  Marshal  on  a  large  Army  post  In  the  mldwestern  United  States. 
Post  activities  Include  a  basic  training  center,  an  off leer /enlisted 
training  school,  and  maneuver  units  under  a  separate  brigade 
organization,  There  Is  a  permanent  on-post  dependent  population  of 
approximately  4,000.  Provost  activities  supported  Include  physical 
security  and  crime  prevention,  Investigations,  traffic  and  game  warden 
operations,  K-9  section,  AWOL  apprehenslon/clvll  liaison,  vehicle  and 
weapons  registration,  and  operation  of  the  Installation  detention 
facility.  95B  personnel  also  must  complete  Individual  soldier  training, 
SQT  testing,  command  and  maintenance  Inspections,  and  team/unit  training. 

—Training  (CONUS)  Scenario  for  MOS  95B 


Your  tank  battalion  Is  assigned  to  a  U.S.  Corps  in  Europe.  Hostilities 
have  broken  out  and  the  Corps  combat  units  are  engaged.  The  Corps 
mission  Is  to  defend,  then  reestablish,  the  host  country's  border. 
Pockets  of  enemy  alrborne/hel Iborne  and  guerilla  elements  are  operating 
throughout  the  Corps  sector  area.  The  Corps  maneuver  terrain  Is  rugged, 
hilly,  and  wooded,  and  weather  Is  expected  to  be  wet  and  cold.  Limited 
Initial  and  reactive  chemical  strikes  have  been  employed  but  nuclear 
strikes  have  not  been  Initiated.  Air  parity  does  exist. 

--Combat  Scenario  for  MOS  19E 


Figure  III. 3.  Scenarios  used  In  SME  ratings  of  task  Importance  for 
task-based  MOS-spedflc  tests. 


Difficulty,.  To  calculate  judged  task  difficulty,  the  mean  of  the 
distribution  of  10  hypothetical  soldiers  across  the  five  perfor¬ 
mance  levels  of  each  task,  averaged  across  SMEs,  was  computed. 

Performance  Variability.  The  standard  deviation  of  the  dlstrlbu- 
t Ion  of  l()  hypothetical  soldiers  across  the  five  performance  levels 
on  each  task  was  averaged  across  SMEs.  This  statistic  Is  Intended 
to  be  an  Indicator  of  the  variability  In  performance  that  would  be 
expected  of  a  task. 

Selection  of  Tasks  To  Be  Tested 

While  the  methods  used  for  selecting  tasks  were  similar  for  Batch  A  and 
Batch  B  MOS,  there  were  enough  differences  to  warrant  their  being  outlined 
separately. 

Batch  A  Test  Task  Selection.  From  five  to  nine  project  staff,  Including 
the  Individual  wRo  Facl  prime  responsibility  for  that  particular  MOS, 
participated  In  the  selection  process  for  each  MOS.  The  task  selection  panel 
was  provided  the  data  summaries  of  the  SME  judgments  and  asked  to  make  an 
Initial  selection  of  35  tasks  to  represent  each  MOS.  No  strict  rules  were 
Imposed  on  the  analysts  in  making  their  selections,  although  they  were  told 
that  high  Importance,  high  performance  variability,  a  range  of  difficulty, 
and  frequently  performed  tasks  were  desirable,  and  that  each  cluster  should 
be  sampled. 

To  capture  the  policy  that  each  staff  person  used,  task  selections  were 
first  regressed  on  the  task  characteristics  data  to  Identify  Individual 
selection  policies.  The  equations  were  then  applied  to  the  task 
characteristics  data  to  provide  a  prediction  of  the  task  selections  each 
Individual  would  have  made  If  his  or  her  selections  were  completely 
consistent  with  a  linear  model. 

In  the  second  phase  of  selection,  analysts  were  given  their  original 
task  selections  and  the  selections  predicted  by  their  regression-captured 
policies.  They  were  directed  to  review  and  justify  discrepancies  between 
their  observed  and  predicted  selections.  Analysts  independently  either 
modified  their  selections  or  justified  their  original  selections.  The 
rationale  for  Intentional  discrepancies  was  identified  and  the  regression 
equations  adjusted. 

The  next  phase  was  a  Delphi-type  negotiation  among  analysts  to  converge 
their  respective  choices  Into  a  list  of  35  tasks  for  each  MOS.  Information 
on  the  choices  and  rationale  provided  by  each  analyst  In  t!«e  preceding  phase 
was  distributed  to  all  analysts,  and  each  made  a  decision  to  retain  or  adjust 
his  or  her  decisions,  taking  Into  account  opinions  others  had  expressed. 
Decisions  and  revisions  were  collected,  collated,  and  ^distributed  as  needed 
until  near  consensus  was  reached.  For  all  MOS,  three  Iterations  were 
necessary. 
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The  resulting  task  selection  lists  were  mailed  to  each  Proponent;  a 
briefing  by  Project  A  staff  was  provided  if  requested.  A  Proponent 
representative  then  coordinated  a  review  of  the  list  by  Proponent  personnel 
designated  as  having  the  appropriate  qualifications.  After  some  minor 
Proponent- recommended  adjustments,  the  'Inal  list  of  30  tasks  was  selected. 
These  are  listed  In  Appendix  F,  ARI  Note  in  preparation. 

Batch  B  Test  Task  Selection  Based  on  experiences  with  Batch  A 
selection,  some  modifications  were  introduced  in  the  selection  process  for 


selection,  50me  ITIUUIT  Itouuna  were  lliuyuuuou  in  utic  acicuuwn 

Batch  B.  One  primary  concern  was  to  Involve  Proponent  representatives  more 
actively  In  the  selection  process.  Also,  the  Batch  A  experience  showed 
analysts'  selections  to  be  non-linear.  Analysts  qualified  their  selections 
on  the  basis  of  knowledge  of  the  MOS  or  the  tasks,  information  not  directly 
represented  In  the  data;  they  used  non-linear  combinations  that  often 
differed  within  each  cluster.  Therefore,  the  decision  was  made  to  drop  the 
regression  analysis  for  Batch  B  selection. 

The  panel  for  Batch  B  selection  consisted  of  five  to  nine  members  of  the 
project  staff,  as  in  Batch  A,  combined  with  six  military  personnel  (NCO  and 
officers)  from  each  MOS.  These  six  were  in  the  grade  of  E-6  or  higher  with 
recent  field  experience,  and  were  selected  to  provide  minority  and  gender 
(for  applicable  MOS)  representation  to  the  task  selection  process. 

The  materials  provided  the  selection  panel  were  the  same  variables 
generated  by  the  SME  judgments.  Again,  no  strict  rules  were  Imposed. 
However,  panel  members  were  provided  a  target  number  of  tasks  to  be  selected 
from  each  cluster  (calculated  In  proportion  to  the  total  number  of  tasks  In 
each  cluster).  A  second  adjustment,  prescribed  a  minimum  of  two  tasks  per 
cluster  to  permit  estimation  of  the  correlation  among  tasks  In  the  cluster. 
Within  these  constraints  the  Delphi  procedure  was  again  used  to  reach 
consensus.  The  tasks  selected  for  Batch  B  MOS  are  listed  In  Appendix  F , 
ARI  Research  Note  in  preparation. 

Assignment  of  Tasks  to  Test  Mode 

The  initial  development  plan  required  that,  for  each  MOS,  knowledge 
tests  be  developed  for  all  30  tasks,  and  hands-on  tests  for  15  of  these  tasks 
(since  such  testing  for  the  all  30  tasks  would  exceed  the  hands-on 
resources).  The  considerations  that  constrained  selection  for  hands-on 
testing  were: 

•  Fifteen  soldiers  must  complete  all  15  hands-on  tests  In  4  hours. 
No  single  test  is  to  take  more  than  20  minutes. 

•  Scorer  support  would  be  limited  to  eight  NCO  scorers. 

•  The  harids-on  test  site  must  be  within  walking  distance  cf  the  other 
test  activities. 

«  Equipment  requirements  must  be  kept  within  reason  If  units  are  to 
support  the  requirements. 
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•  The  test  must  be  administrable  in  a  number  of  installations.  Tasks 
must  not  be  affected  by  local  operating  procedures. 


On  the  basis  of  these  constraints,  in  each  MOS  a  project  staff  member 
prepared  an  anticipated  hands-on  test  approach  for  each  of  the  30  tasks  in 
the  MOS.  Working  Independently,  each  of  the  five  project  analysts  first 
reviewed  the  suggested  test  approach,  and  modified  It  as  he/she  deemed 
necessary.  The  analyst  then  assigned  points  to  each  task  to  Indicate 
hands-on  test  suitability,  using  the  following  three  areas  of  consideration: 


•  Skill  Requirements  -  Analysts  determined  a  numerical  value  for  skill 
requirements  based  on  the  number  of  steps  requiring  physical 
strength,  control,  or  coordination.  Each  skill  step  was  counted  as 
one  point. 


•  Omission  -  This  rating  considered  the  likelihood  that  a  soldier  would 
omit  a  required  step.  For  a  step  to  have  "omission  value": 


-  A  soldier  must  be  able  to  complete  the  procedure  (albeit 
Incorrectly)  without  performing  the  step. 


-  Nothing  In  the  test  situation  must  cue  the  soldier  to  do  the 
step. 


Each  "omission  step"  received  a  numerical  rating  of  one. 


Time  Value  -  When  "doctrine"  (usually  the  SM)  specified  a  time  limit 
for  task  performance,  the  task  was  awarded  a  numerical  value  of  two. 
Where  no  doctrinal  time  limit  has  been  established  but  where  time 
would  be  a  reliable  Indication  of  task  proficiency,  the  task  was 
awarded  a  numerical  value  of  one. 


Following  the  Individual  ratings,  analysts  met  in  group  discussions  and 
proceeded  task  by  task  to  resolve  differences  until  a  consensus  was  reached 
and  a  single  numerical  score  was  assigned  to  each  task.  The  tasks  were  then 
rank  ordered,  and  a  final  feasibility  check  was  conducted  to  ensure  that  the 
top  15  rated  tasks  fell  within  the  4-hour  time  limit.  As  an  example,  the 
tasks  selected  and  assigned  for  MOS  11B  are  shown  in  Figure  III. 4. 


Construction  of  Hands-On  and  Knowledge  Tests 


For  both  hands-on  and  knowledge  tests,  the  primary  source  of  Information 
was  task  analysis  data.  Task  analyses  were  derived  from  the  Soldier's 
Manuals,  technical  manuals,  and  other  supporting  Army  publications,  as  well 
as  SME  input  and  direct  task  observation  where  necessary.  Much  of  the 
development  effort  Involved  having  specific  staff  members  work  on  both  types 
of  tests  for  the  same  tasks. 
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HANDS-ON  AND  KNOWLEDGE  TESTS 
Put  on  Field  or  Prcwurc  Droning 
Perform  Operator  Maintananoa  on  M18A1 
Load,  Reduce  Sloppaga,  Claar  M«0  Machina  Qun 
Sat  HaadapaeayTImlng  on  .80  Cal  Maehlna  Qun 
Engaga  Targata  With  Hand  Qranadaa 
Prapara  Dragon  lor  Firing 
Prapara  Ranga  Card  lor  M80  Maehlna  Qun 
Engaga  Targata  With  LAW 
Put  on/Waar  M17  Qaa  Mask 
Oparata  Radio  AN/PRC-77 
Operate  aa  Station  In  Radio  Nat 
Inatall/rire/Recover  Claymore 
Taehnlquaa  o i  MOUT 
Zero  AN/PV8-4  onM16A1 
Conduct  Surveillance  w/c  Electronic  Oevleea 


KNOWLEDGE  TESTS  ONLY 
Perform  CPR 

Admlnleter  Narva  Agent  Antidote 
Call  lor/Ad|uat  Indirect  Fire 
Navigate  on  Ground 
Put  on  Protective  Clothing 
Colleot/Report  Information  —  SALUTE 
Camouflage  Sell  and  Equipment 
Raeognlxa  Armored  Vahlolae 
Move  Under  Direct  Fire 
Estimate  Range 

Perform  PMCS-M113  or  1/4  Ton 
Drfve  Wheeled  or  Traok  Vehicle 
Hasty  Firing  Position,  Urban  Terrain 
Establish  Observation  Poat 
Select  Overwatch  Poaltlon 
Plaoe  AN/PVS-S  Into  Operation 


Figure  III. 4.  Infantryman  (MOS  11B)  tasks  selected  for 
Hands-On/Knowledge  Testing. 


Hands-On  Test  Development.  The  model  for  hands-on  test  development 
emphasized  four  activities: 

•  Determine  test  conditions.  Test  conditions  are  designed  to  maximize 
the  standardization  of  the  test  between  test  sites  and  among  soldiers 
at  the  same  test  site.  Test  conditions  are  determined  for  the  test 
environment,  equipment,  location,  and  task  limits, 

•  List  performance  measures.  The  performance  measures  are  the  sub- 
stantlal  elements  of  the  task  to  be  tested  and  the  behaviors  to  be 
rated  GO/NO-GO  by  the  scorer.  Performance  measures  are  defined  as 
either  product  or  process  depending  on  what  the  scorer  Is  directed  to 
observe  to-  score  the  behavior.  Performance  measures  must  adhere  to 
the  following  principles: 

-  Describe  observable  behavior  only. 

-  State  a  single  pass/fall  behavior, 

-  Contain  only  necessary  actions. 
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-  Contain  a  standard  (how  much  or  how  well). 

-  State  an  error  tolerance  limit  if  variation  In  behavior  Is 
permissible. 

-  Include  a  scored  time  limit  If,  and  only  If,  the  task  or  step  Is 
doctrinal ly  time-constrained;  that  Is,  the  Soldier's  Manual 
specifies  a  time  limit  for  performing  the  task. 

-  Include  a  sequence  requirement  If,  and  only  If,  sequence  Is 
doctrlnally  required. 

•  State  examinee  Instructions.  The  Instructions  must  be  kept  very 
short  and  very  simple;"  any  Information  not  absolutely  essential  to 
performance  must  be  excluded.  Examinee  Instructions  are  read 
verbatim  to  the  soldier  by  the  scorer  and  may  be  repeated  at  any 
time.  These  written  Instructions  are  the  only  verbal  communications 
the  scorer  Is  allowed  to  have  with  the  soldier  during  the  test. 

•  Develop  scorer  Instructions.  These  Instructions  tell  the  scorer  how 
to  set  up,  administer,  and  score  the  test.  They  cover  both  usual  and 
unusual  situations,  and  ensure  standardized  administration  and 
scoring. 

Examples  of  one  hands-on  task  from  the  MOS  711  and  MOS  1  IB  protocols  are 
shown  In  Figures  III. 5  and  III. 6. 

Knowledge  Test  Development.  The  format  of  the  knowledge  tests  was 
dictated  by  their proposed use.  For  example,  free-response  formats  demand 
more  of  the  soldier's  literacy  skills  and  are  more  difficult  to  score 
reliably  than  are  multiple-choice  formats,  which  are  easier  to  score  and  are 
familiar  to  most  soldiers.  However,  multiple-choice  formats  are  difficult  to 
develop  because  of  inherent  cueing,  particularly  between  Items,  and  the  need 
to  develop  alternatives  that  are  likely  and  plausible  but  clearly  wrong. 
Because  of  the  large  quantity  of  data  to  be  gathered  in  the  project,  machine 
scoring  Is  essential.  Therefore,  a  multiple-choice,  single-correct-response 
format  was  selected. 

Test  administration  constraints  dictate  the  number  of  tasks  to  be  tested 
and  the  time  available  for  testing.  For  Project  A,  all  tasks  selected 
(approximately  30  per  MOS)  would  be  tested  in  the  knowledge  mode.  Four  hours 
were  allocated  to  the  knowledge  testing  block  for  the  field  trials,  to  be 
reduced  to  2  hours  for  Concurrent  Validation  testing.  Allowing  an  average  of 
slightly  less  than  1  minute  to  read  and  answer  one  item  dictated  an  average 
of  about  nine  items  per  task. 

Knowledge  test  development  was  based  on  the  same  information  that  was 
available  for  hands-^n  development.  However,  three  distinct  characteristics 
of  multiple-choice  performance  knowledge  test  items  are  that  they: 

•  Are  performance -based.  Must  tasks,  of  course,  cannot  elicit  full 
job-like  behavior  In  the  knowledge  mode  and  therefore  must  be  tested 
using  performance-based  items.  These  items  require  the  examinee  to 
select  an  answer  describing  how  something  should  be  done. 


SCORELHEET 
fYPE  A  MEMORANDUM 


8 


Time 


Start: 


Finish: 


Scorer: 


SCORESHEET 


Soldier: 


& 


p£ 


Date: 


SSN: 


Soldier  ID: 


* 


NOTE  TO  SCORER: 


Tell  the  soldier:  "ASSUME  YOU  HAVE  JUST  RECEIVED  THIS 
DRAFT  TO  BE  TYPED  AS  A  MEMORANDUM  IN  FINAL  FORM.  YOU 
MAY  REFER  TO  THE  SUPPLEMENT  BOOK  AND  THE  DICTIONARY  IF 
YOU  WANT.  YOU  CAN  MAKE  CORRECTIONS.  WORK  AS  FAST  AND 
AS  ACCURATELY  AS  YOU  CAN,.  HOW  MANY  COPIES  OF  THE 
ORIGINAL  WOULD  YOU  MAKE?"  NUMBER  OF  COPIES: 


K 

»*• 

A 


PERFORMANCE  MEASURES 


NO-GO 


A 


1.  Correct  number  of  copies  (2) 

•  1  white 

•  1  yellow  manifold- 


2.  One  inch  left  and  right  margins 

•  +^1  space 

3.  Correct  letterhead 

•  5th  line  below  top 

•  Centered 

•  DA  all  caps,  other  initial  caps 

4.  Correct  reference  (office)  symbol- 
AZAK-YD 

•  Left  margin 

•  4th  line  below  last  letterhead  line 

5.  Correct  date  -  10  October  1994  or 

10  OCT  84 

•  End  right  margin 

•  Same  line  as  reference 


Figure  IfI.5.  Administrative  Specialist  (MOS  71L)  Hands-On  Performance  Test 
sample  (Page  1  of  2). 
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.* 

I 


PERFORMANCE  MEASURES 


GO 


NO-GO 


6.  Correct  memo  addressee 

•  4th  line  below  reference 
e  Left  margin 

•  All  capitals 

•  2nd  addressee  below  1st 

7.  Correct  SUBJECT  line 
e  Left  margin 

e  2nd  line  below  last  addressee 

•  Colon  after  SUBJECT#  2  spaces 

8.  Correct  body 

e  5th  line  below  subject#  left  margin 
e  Paragraphs  numbered 
e  Numbers  and  all  lines  in  left  margin 
e  Single  space 

e  Double  space  between  paragraphs 

9.  Correct  Authority  line 

e  2nd  line  below  last  body  line 
e  Left  margin 
e  All  caps 

e  Colon  after  FOR  THE  COMMANDER 


f 

I 


is 


10.  Correct  signature  block 

e  5th  line  below  Authority  line 
e  Begins  at  center 

e  Name  (all  caps)#  rank  (initial  or  all 
caps) ,  branch  (all  caps) 
e  Title  initial  caps 


11.  Corrections  neat  and  clean 


Number  of  typographical  errors: 

•  Sample  typos:  strikeovers,  misspelling, 

*  incorrect  punctuation#  incorrect  spacing. 


i 

I 


Figure  III. 5.  Administrative  Specialist  (MOS  71L)  Hands-On  Performance  Test 
sample  (Page  2  of  2). 
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Scorer : 


Soldier 


Know  Soldier? 
Soldier  in  CO.? 
Supervise  Soldier? 


Dace: 


ID: 


SCORESHEET 

LOAD,  REDUCE  A  STOPPAGE  AND  CLEAR  AN 
H60  MACHINEGUN 


INSTRUCTIONS  TO  SOLDIER:  For  this  teat  you  oust  load,  fire,  apply  immediate 
action  and  clear  Cha  M60  oachinegun.  Do  not  go  on  to  the  next  procedure  until 
I  tall  you  to,  First,  you  must  load  and  firs  the  machinegun,  Begin, 

PERFORMANCE  MEASURES:  CO  NO-GO  COMMENTS 

Load 

1.  Placed  safety  in  FIRE.  _____  ______ 

2.  Pulled  cocking  handle  to  the 

rear,  locking  bolt  to  the  rear.  _____  ______ 

3.  Returned  cocking  handle  forward.  _  ______  _ _ __ 

A.  Placed  safety  in  SAFE.  _____  _ ____  __________________ 

5.  Raised  cover.  ___  ______  . 

6.  Lifted  rear  of  gun  alightly  to 

observe  into  chamber.  _ _ _ _ 

7.  Positioned  belt  with  double 

loop  toward  gun  and  split  link  down,  _  _ _ 

8.  Placed  round  in  feed  tray  groove.  _____  ______  _____________  _ _ 

9.  Closed  cover.  ___ _  _ _ _ 

# 

10.  Lifted  up  on  cover  to  insure 

cover  locked.  _____  _ 

11.  Placed  safety  in  FIRE.  _____  ___ __  _______________ 

12.  Pulled  trigger,  ____  ______  . 

13.  Performed  steps  in  sequence.  _____  '  _ _ 

Seconds  to  load  machinegun:  _ 


Figure  III. 6.  Infantryman  (MOS  11B)  Hands-On  Performance  Test  sample 
(Page  1  of  2). 


PERFORMANCE  MEASURES: 


Immediate  Action 


INSTRUCTIONS  TO  SOLDIER:  You  have  been 
firing  the  machinegun  and  the  weapon 
suddenly  stops  firing.  Apply  immediate 
action.  Begin. 

14.  Fulled  cocking  handle  to  the  rear, 
locking  bolt  to  rear.  (Round  ejecta) 

15.  Returned  cocking  handle  forward. 

16.  Pulled  the  trigger. 

17.  Performed  steps  in  sequence. 

Seconds  to  perform  immediate  action: 

INSTRUCTIONS  TO  THE  SOLDIER:  You  oust 
now  unload  end  clear  the  aachinegun* 
Begin. 

IB.  Pulled,  cocking  handle  to  the  rear, 
locking  bolt  to  the  rear. 

19.  Returned  cocking  lever  forward* 

20.  Placed  safety  in  SAFE. 

21.  Opened  cover. 

22.  Reroved  munition  belt. 

23.  Lifted  rear  of  gun  slightly 
to  observe  into  chamber. 

24.  Closed  cover. 

25.  Placed  safety  in  FIRE. 

26.  Pulled  cocking  handle  to  rear. 

27.  Pulled  trigger  end  eased  cocking 
handle  forward.  (Must  hold  onto 
handle:  bolt  must  not  slam  forward.) 

28.  Placed  safety  in  SAFE. 

29.  Performed  steps  in  sequence. 

Seconds  to  unload  and  clear  aachinegun: 


GO  NO-GO 


COMMENTS 


Figure  III. 6. 


Infantryman  (MOS  11B)  Hands-On  Performance  Test  sample 
(Page  2  of  2), 
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A  prevalent  pitfall  in  performance  knowledge  test  development  is  a 
tendency  to  cover  information  about  why  a  step  or  action  Is  done  or 
rely  on  technical  questions  about  thetask  or  equipment.  Just  as  In 
the  hands-on  tests,  the  objective  of  the  knowledge  test  Is  to  measure 
the  soldier's  ability  to  perform  a  task.  The  knowledge  or  recall 
required  by  the  test  item  must  not  exceed  what  is  required  of  a 

soldier  when  he  or  she  Is  actually  performing  the  task.  Because  of 
this  performance  requirement,  knowledge  tests  must  present 
job-relevant  stimuli  as  much  as  possible,  and  the  liberal  use  of 
quality  Illustrations  Is  essential. 

•  Identify  performance  errors.  Performance-based  knowledge  tests  must 
focus  on  what  soldiers  'do  when  they  fall  to  perform  the  task  or  steps 
In  the  task  correctly. 

e  Present  likely  alternatives.  The  easiest  answer  to  write  Is  the 
correct  alternative.  The  approach  here  focuses  on  Identifying  what 
It  Is  soldiers  do  wrong  when  they  perform  a  step;  that  Is,  If  they  do 
not  perform  the  step  correctly,  what  Is  It  that  they  do  perform? 

This  Information  becomes  the  basis  for  the  other  alternatives. 

Knowledge  tests  were  constructed  by  project  personnel  with  experience  In 
test  Item  construction  and  expertise  In  the  MOS/task  being  tested.  Test 
Items  were  reviewed  Internally  by  a  panel  of  test  experts  to  ensure 
consistency  between  Individual  developers.  The  following  general  guidelines 
were  used  In  construction: 

•  Stem  length  for  Items  was  usually  restricted  to  two  lines.  Where 

needed,  a  "Situation"  was  separately  described  If  it  could  be  applied 
to  two  or  more  Items. 

•  Item  stems  were  designed  so  that  the  Item  could  be  answered  based  on 
the  stem  alone,  that  Is,  without  reference  to  the  alternatives. 

•  Illustrations  were  used  where  they  could  duplicate  job  cues.  Where 
necessary,  Illustrations  were  also  used  as  alternatives  or  to  provide 
a  job-related  reference.  All  Illustrations  were  drawings. 

•  Each  task  tested  was  a  separate  entity,  clearly  identified  by  task 
title  and  distinct  from  other  tested  tasks. 

•  For  Items  that  allowed  or  required  use  of  publications  on  the  job, 

abstracts  were  prepared.  If  the  publication  was  lengthy  (e.g., 
tables  of  vehicle  maintenance  checks),  the  abstract  was  provided  as  a 
separate  handout.  Brief  abstracts,  of  one  page  or  less,  were 
appended  to  the  test.  Materials  needed  in  performance  knowledge 

tests,  such  as  maps,  protractors,  and  scratch  paper,  were  also 
provided. 

•  Test  Items  within  a  task  test  were  arranged  in  the  sequence  in  which 
they  would  normally  occur  when  the  soldier  performed  the  task. 
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*  Completed  tests  were  checked  for  inter-item  cueing. 

•  All  correct  alternatives  were  authenticated  as  correct  by  a  citable 
reference. 

In  four  of  the  nine  MOS,  some  of  the  tasks  that  incumbents  perforin  are 
affected  by  the  type  of  equipment  to  which  they  are  assigned.  For  these  MOS 
it  was  necessary  to  develop  separate  tracked  versions  of  tests  covering  the 
specific  items  of  equipment  Involved. 


Pilot  Tests  and  Revisions 

Following  construction  of  the  tests,  arrangements  were  made  through  the 
Proponent  for  troop  support  for  a  pilot  testing  of  the  hands-on  and  knowledge 
tests.  This  procedure  was  conducted  by  the  test  developer  and  Involved  the 
support  of  four  NCO  scorers/SHEs,  five  MOS  Incumbents  in  SL1,  and  the 
equipment  dictated  by  the  hands-on  test. 

Pilot  of  Hands-On  Tests.  The  following  activities  were  performed: 

1.  Test  Review  -  The  four  NCO  scorers  independently  reviewed  the 
Instructions  to  scorer,  and  the  scoresheets.  The  developer  noted 
comments  or  questions  that  could  be  clarified  by  changes  or 
additions  to  the  materials. 

2.  Test  Set-Up  -  One  of  the  scorers  set  up  the  test  as  directed  in  the 
prepared  instructions.  The  developer  noted  deficiencies  or 
necessary  changes  in  the  instructions. 

3.  Scoring  -  One  of  the  incumbents  performed  the  test  while  the  four 
scorers  scored  the  test  Independently.  After  the  test,  all  four 
scoresheets  were  compared.  Discrepancies  in  scoring  were  discussed 
and  the  reasons  ascertained.  Some  scorer  discrepancies  were  the 
result  of  a  scorer's  physical  position  relative  to  the  incumbent, 
but  many  required  changing  a  performance  measure  or  the  instructions 
to  scorers,  or  even  changing  the  test  or  performance  procedure  it¬ 
self.  If  possible,  these  changes  were  made  before  the  next 
Incumbent  was  tested.  Normally,  variations  in  Incumbent  perfor¬ 
mances  occur  naturally,  but  to  ensure  variation  the  developer  could 
cue  Incorrect  performance  without  the  scorers'  knowledge.  Testing 
continued  with  the  other  Incumbents,  followed  by  scoresheet  review 
and  revision.  The  incumbents  were  included  in  the  review  process  to 
assist  In  determining  how  they  actually  performed. 

4.  Examinee  Debriefings  -  Incumbents  were  interviewed  to  determine 
whether  the  Instructions  provided  them  adequate  guidance  on  what 
they  were  expected  to  do. 

5.  Tine  Data  -  Performance  times  were  kept  on  all  Incumbents,  as  were 
station  and  test  set-up  times. 


Based  on  the  pilot  test  Information,  a  final  version  of  each  hands-on 
test  was  prepared.  These  tests  are  contained  In  Appendix  6,  ARI  Research 
Note  In  preparation. 

Pilot  of  Knowledge  Tests.  The  knowledge  tests  were  pilot  tested  at  the 
same  time  as  the  hands-on  measures.  The  same  four  NCO  hands-on  scorers  and 
five  MOS  Incumbents  were  utilized  but  the  procedure  was  different  for  the  two 
groups. 

1.  NCO  SME  -  The  test  developer  went  through  each  .test,  Item  by  Item, 
with  all  four  NCOs  simultaneously.  The  specific  questions  addressed 
were: 

e  Would  the  SL1  soldier  be  expected  to  perform  this  step,  make  this 
decision,  or  possess  this  knowledge  In  the  performance  of  this 
task  on  the  job? 

e  Is  the  keyed  alternative  correct? 

•  Are  the  "Incorrect"  alternatives  actually  Incorrect? 

•  Is  there  anything  In  local  or  unit  SOP  that  would  affect  the  way 
this  task  Item  Is  performed? 

•  Are  the  Illustrations  clear,  necessary,  and  sufficient? 

•  Is  there  any  aspect  of  this  task  that  Is  not  covered  In  the  test 
but  should  be  covered? 

2.  Incumbents  -  The  five  Incumbents  took  the  test  as  actual  examinees. 
They  were  briefed  as  to  the  purpose  of  the  pilot  test  and  told  to 
attempt  to  answer  all  Items.  The  tests  were  administered  by  task 
and  the  time  to  complete  each  task  test  was  recorded  Individually. 
After  each  task  test  the  Incumbents  were  debriefed.  The  following 
questions  were  addressed: 

•  Were  there  any  items  where  you  did  not  understand  what  you  were 
supposed  to  do  or  answer? 

•  Were  there  any  Illustrations  that  you  did  not  understand? 

•  (Item  by  Item)  This  Is  what  Is  supposed  to  be  the  correct  answer 

for  Item _ .  Regardless  of  how  you  answered  It,  do  you  agree 

or  disagree  that  this  choice  should  be  correct? 

Revisions  based  on  SME  and  incumbent  Inputs  were  made  to  the  tests.  On 
the  basis  of  the  times  obtained  for  the  Incumbents,  the  tests  were  divided 
Into  four  sections  or  booklets  of  approximately  equal  lengths  for  Batch  A 
MOS;  for  Batch  B,  tests  were  divided  Into  two  booklets.  The  purpose  of 
dividing  tests  Into  several  booklets  and  varying  the  order  of  administration 
among  groups  of  soldiers  was  to  distribute  fatigue  effects.  These  revised 
versions  of  the  tests  are  contained  In  Appendix  H,  In  ARI  Research  Note  In 
preparation. 
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Construction  of  Auxiliary  Instruments 


Task-Specific  Performance  Rating  Scales.  Development  of  hands-on  and 
knowledge  tests  provided  two  methods  of  measuring  the  sample  of  15  tasks.  As 
a  third  method ,  the  soldier's  peers  and  supervisors  were  asked  to  rate  the 
soldier's  performance  on  those  same  15  tasks  by  means  of  a  7-polnt  numerical 
rating  scale.  The  Intent  was  to  assess  performance  on  the  same  set  of  15 
tasks  with  three  different  methods.  The  rating  scales  were  developed  for 
administration  during  the  field  tests. 

Job  History  Questionnaire.  Although  soldiers  in  a  given  MOS  share  a 
common  pool  of  potential  tasks,  their  actual  task  experience  may  vary 
substantially.  The  most  widespread  reason  for  this  difference  is  assignment 
options  In  the  MOS.  The  options  may  be  formal,  such  as  when  an  8L1  Armor 
Crewman  may  be  a  driver  or  a  loader  on  the  tank,  or  they  may  be  Informal 
specializations,  such  as  when  one  Administrative  Specialist  types  orders 
while  another  types  correspondence.,''  A  more  extreme  reason  for  task 
difference  in  an  MOS  occurs  when  soldiers  are  assigned  to  duties  not 
typically  associated  with  their  MOS.  For  example,  an  Armor  Crewman  may  be 
assigned  to  drive  a  1/4-ton  truck,  or  a  Medical  Specialist  may  perform 
clerical  tasks.  Such  soldiers  are  not  given  Special  or  Additional  Skill 
Identifiers,  nor  are  they  considered  to  be  v;ork1ng  In  a  secondary  MOS:  They 
are  simply  tankers  who  drive  trucks  or  medics’  who  type  and  file.  The 
likelihood  of  differences  In  task  experience  is  further  Increased  by 
differences  In  unit  training  emphasis  where  training  schedules  at  battalion, 
company,  and  platoon  level  emphasize  different  tasks.  As  a  result  of  these 
circumstances,  soldiers'  experiences  vary',  even  within  an  MOS  and  location. 

Glvon  that  the  central  thrust  of  Project  A  Is  the  validation  of 
selection  and  classification  measures,  any  differential  task  experience  that 
affects  performance  is  a  contaminating  variable.  That  is,  If  the  differences 
In  task  experiences  of  sampled  soldiers  are  wide  enough  to  have  an  Impact  on 
task  performance,  experience  effects  may  also  be  strong  enough  to  mask 
predictor  relationships  with  performance.  In  this  case,  measures  of 
experience  would  need  to  be  Incorporated  Into  validation  analyses  so  that 
predictor-criterion  relationships  could  be  assessed  independent  of 
experience. 

To  assess  the  likely  impact  of  experience  effects  on  task  performance, 
and  consequently  on  the  Concurrent  Validation  strategies,  a  Job  History  Ques¬ 
tionnaire  was  developed  to  be  administered  to  each  soldier.  Specifically, 
soldiers  were  asked  to  indicate  how  recently  and  how  frequently  (In  the 
preceding  6  months)  they  had  performed  each  of  the  30  tasks  selected  os 
performance  criteria.  A  copy  of  the  questionnaire  for  the  MOS  13B  Cannon 
Crewman  is  included  as  Appendix  J,  ARI  Research  Note  In  preparation. 


Field  Test  Instruments 


At  this  point  the  initial  versions  of  the  hands-on  job  sample  tests  and 
the  multiple-choice  knowledge  tests  had  been  developed,  pilot  tested,  and 
revised.  The  7-polnt  task  performance  rating  scales  and  the  Job  History 


Questionnaire  had  been  constructed.  These  instruments  were  included  in  the 

complete  criterion  array  and  field  tasted  on  samples  of  approximately  150  job 
Incumbents  from  each  of  the  Batch  A  and  Batch  B  MOS. 

The  field  test  procedures  for  the  MOS-specific  task  performance  measures 
are  described  In  Section  8,  and  the  field  test  results  and  subsequent 
modifications  of  the  various  measures  are  described  in  Section  10. 


Section  4 


DEVELOPMENT  OF  MOS-SPECIFIC  BEHAVIORALLY 
ANCHORED  RATING  SCALES  (BARS)l 


A  major  component  of  Project  A  criterion  development  Is  devoted  to  using 
the  critical  Incident  method  to  identify  the  basic  set  of  performance  factors 
that  describe  total  job  performance.  Total  performance  has  been  conceptual¬ 
ized  as  composed  of  two  types  of  factors,  those  that  have  the  same  "‘.eanlng 
and  Interpretation  across  jobs  and  those  that  are  specific  to  a  particular 
Job--thut  Is,  they  are  specific  to  the  job's  content.  This  chapter  deals 
with  the  job-specific  factors  and  the  rating  scales  developed  to  measure 
them. 


The  procedure  used  to  Identify  MOS-specIflc  job  factors  was  derived  In 
large  part  from  procedures  outlined  by  Smith  and  Kendall  (1963)  and  by 
Campbell,  Dunnette,  Arvey,  and  Hellervlk  (1973).  Smith  and  Kendall 
recommended  conducting  critical  Incident  workshops  that  Involve,  as  a  first 
step,  naming  and  defining  the  major  components  of  performance  for  the  job  In 
question.  Workshop  participants  are  then  asked  to  write  samples  of  effective 
and  Ineffective  performance  for  each  of  the  major  components  they  have 
Identified. 

Campbell  et  al.  suggested  a  slight  modification  to  the  Smith  and  Kendall 
procedure,  recommending  that  performance  categories  be  generated  after 
participants  have  had  an  opportunity  to  write  several  incidents.  In  this 
way,  participant's  are  not  constrained  by  a  priori  performance  categories  and 
are  more  likely  to  write  performance  examples  that  represent  all  Job 
requirements. 

In  this  procedure  the  next  step  Involves  editing  the  written  critical 
performance  Incidents.  These  edited  Incidents  are  then  used  to  Identify  the 
major  dimensions  of  the  job  by  asking  supervisors  and  Incumbents  to  read  the 
performance  Incidents  and  make  two  ratings  for  each.  Raters  assigned  each 
Incident  to  a  performance  dimension  and  then  Indicated  the  level  of 
performance  on  the  dimension  represented  by  that  Incident.  The  final  product 
Is  a  set  of  behavlorally  defined  and  anchored  performance  dimensions  that 
focus  on  the  duties  and  standards  of  a  specific  job  or  MOS. 

The  purpose  of  this  part  of  Project  A  Is  to  develop  behavlorally 
anchored  performance  rating  scales  (BARS)  that  assess  job-specific 
performance  factors  for  the  nine  MOS  In  Batch  A  and  Batch  B. 


iThls  section  Is  based  primarily  on  an  ARI  Technical  Report  In  preparation, 
Development  and  Field  Test  of  Behavlorally  Anchored  Rating  Scales  for  Nine 
MOS,  by  Jody  L.  foquam,  Jeffrey  J,  He  henry,  VyVy  A.  Corpe,  ShaTron  R.  Rose, 
Steven  C,  Lammleln,  Edward  Kemery,  Walter  C.  Borman,  Raymond  Mendel,  and 
Michael  J.  Bosshardt,  and  a  supplementary  ARI  Research  Mote,  also  1  rr 
preparation,  which  contains  the  report  appendixes. 
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Development  Procedure 

Each  of  the  nine  MOS  was  assigned  to  a  specific  member  of  the  research 
staff.  This  individual  assumed  responsibility  for  (a)  conducting  workshops 
to  collect  performance  Incidents  for  the  assigned  MOS,  (b)  editing  Incidents, 
(c)  preparing  retrarslafcion  exercises,  (d)  developing  performance  rating 
scales,  and  (e)  revising  the  scales  for  use  In  the  Concurrent  Validation 
efforts.  Thus,  a  single  researcher  became  an  "expert"  concerning  the  job 
cuties  and  requirements  involved  In  the  assigned  MOS. 


Workshop  Participants 


Incumbents,  or  first-term  enlistees,  from  target  MOS  were  not,  as  a 
rule,  Included  In  the  workshops,  because  their  experience  with  the  job  was 
relatively  limited.  Almost  all  participants  were  noncommissioned  officers 
(NCGs)  who  were  directly  responsible  for  supervising  first-term  enlistees  and 
who  had  spent  2  to  4  years  as  first-termers  In  these  MOS  themselves. 
Consequently,  most  workshop  participants  were  familiar  with  the  Job 
requirements  front  both  an  Incumbent  and  a  supervisor  perspective. 


To  ensure  thorough  coverage  and  representation  of  the  critical  behaviors 
comprising  each  MOS,  workshops  for  each  MOS  were  conducted  at  six  CONUS 
(Continental  United  States)  Army  posts.  Each  post  was  asked  to  designate 
from  10  to  16  NCCs  for  each  target  MOS.  Thus,  the  goal  was  to  obtain  input 

J  “  W 


MOS. 


The  total  number  of  NCOs 


from  about  60  to  96  supervisors  for  each 
participating  In  the  performance  Incident  workshops  by  MOS  Is  shown  In  Table 
III. 13.  The  total  array  of  posts  at  which  workshops  were  held  is  shown  In 
Table  III. 14. 


Table  III. 13 


Participants  In  KQS-Speclflc  EARS  Workshops 


MOS 

Number  of  Participants 

Batch  A 

1 30 

Cannon  Crewman 

88 

640 

Motor  Transport  Operator 

81 

711 

Administrative  Specialist 

63 

95B 

Military  Police 

86 

Batch  B 

IIP 

Infantryman 

83 

19E 

Armor  Crewman 

65 

31C 

Radio  Teletype  Operator 

60 

63B 

Light  Wheel  Vehicle  Mechanic 

75 

91A 

Medical  Specialist 

71 
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Locations  and  Dates  of  MOS-SpecIflc  BARS  Workshops 


Location 

Dates 

Batch  A 

Fort 

Ord 

25-26  August  1983 

Fort 

Polk 

29-30  August  1983 

Fort 

Bragg 

12*13  September  1983 

Fort 

Campbell 

15-16  September  1983 

Fort 

Hood 

13-14  October  1983 

Fort 

Carson 

31  October  -  1  November  1983 

Batch  B 

Fort 

Lewis 

9-11  January  1904 

Fort 

Stewart 

11-13  January  1984 

Fort 

Riley 

16-18  January  1984 

Fort 

Bragg 

27-29  February  1984 

Fort 

Bliss 

12-14  March  19e4 

Fort 

Sill 

14-16  March  1984 

Collection  of  Data  on  Performance  Incidents 

After  a  workshop  group  was  convened,  research  staff  members  serving  as 
workshop  leaders  described  Project  A  and  briefed  participants  on  the  purpose 
of  the  workshop.  This  led  to  a  discussion  of  the  different  types  of  perfor¬ 
mance  rating  scales  available  and  to  a  discussion  of  the  advantages  of  using 
behavioral ly  anchored  rating  scales  to  assess  job  performance.  Leaders  then 
described  how  the  results  from  the  day's  activities  would  be  used  to  develop 
this  type  of  rating  scale  for  that  particular  MOS . 

Workshop  leaders  then  provided  Instruction  for  writing  performance  Inci¬ 
dents  and  distributed  performance  Incident  forms.  Participants  were  asked  to 
generate  accounts  of  performance  Incidents,  using  examples  provided  as 
guides.  Participants  were  asked  to  avoid  writing  about  activities  or 
behaviors  that  reflect  general  soldier  effectiveness  (e.g.,  following  rules 
and  regulations,  military  appearance),  as  these  requirements  have  been 
Identified  and  described  In  another  part  of  the  project. 

After  about  4-5  hours,  performance  incident  writing  was  halted  and  work¬ 
shop  leaders  began  generating  discussion  about  the  major  components  or 
activities  comprising  the  job.  During  thlj  discussion,  participants  were 
asked  to  Identify  the  major  job  performance  categories,  which  workshop 
leaders  recorded  or,  a  blackboard  or  flipchart.  When  participants  Indicated 
that  all  possible  performance  categories  had  been  Identified,  the  workshop 
leader  asked  them  to  review  the  list  and  consider  whether  or  not  all  job 
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duties  were  Indeed  represented.  The  leader  also  asked  participants  to 
consider  whether  each  category  represented  first-term  enlistee  job 
requirements  or  requirements  of  more  experienced  soldiers. 

Following  this  discussion,  participants  were  asked  to  rev  low  the  perfor¬ 
mance  Incidents  they  had  written  and  to  assign  them  to  one  of  the  job 
categories  or  dimensions  that  appeared  on  the  blackboard  or  flipchart.  The 
workshop  leader  tallied  the  total  number  of  Incidents  In  each  category. 
Those  categories  with  very  few  Incidents  were  the  focus  of  the  remainder  of 
the  workshop;  participants  were  asked  to  spend  the  remaining  time  generating 
performance  Incidents  for  those  categories. 

Results  from  the  performance  incident  workshops  are  reported  in  Table 
III. 15  for  Batch  A  MOS  and  In  Table  III. 16  for  Batch  B  MOS.  The  number  of 
participants  and  the  number  of  performance  Incidents  generated  are  reported 
oy  MOS  and  by  post.  The  mean  numbers  of  Incidents  generated,  the  total 
number  of  participants,  and  total  number  of  Incidents  are  also  reported  by 
MOS  and  by  post. 

The  schedule  permitted  research  staff  members  time  to  edit  and  review 
performance  Incidents  between  data  collection  activities.  For  example, 
following  the  data  collection  activities  at  Fort  Bragg  and  Fort  Campbell, 
performance  Incidents  were  edited,  content  analyzed,  and  sorted  Into 
categories.  These  categories  were  then  Integrated  with  those  generated  dur¬ 
ing  the  earlier  workshops  and  discussed  with  participants  In  subsequent 
workshops  held  at  Fort  Hood  and  Fort  Carson. 

A  similar  Iterative  procedure  was  used  to  generate  performance  dimen¬ 
sions  for  the  Batch  B  MOS. 


Retranslation  Activities 

A  primary  objective  of  the  retranslation  exercise  Is  to  verify  that  the 
performance  dimension  system  provides  a  thorough  and  comprehensive  coverage 
of  the  critical  job  requirements.  The  evidence  for  this  verification  Is  high 
agreement  imong  judges  that  specific  Incidents  represent  particular 
components  (factors)  of  performance,  that  all  hypothesized  factors  can  be 
represented  by  Incidents,  and  that  all  Incidents  In  the  sample  can  be 
assigned  to  a  factor  (If  they  cannot,  factors  may  be  missing). 

A  second  objective  Involves  constructing  the  performance  anchors  for 
each  dimension.  Participants  are  asked  to  rate  the  level  of  performance 
described  In  the  Incident.  These  ratings  are  then  used  to  help  construct, 
behavioral  anchors  that  describe  typical  performance  at  different  effective- 
ness  levels  within  a  single  performance  dimension. 

Retranslation  procedures  employed  for  Batch  A  MOS  differed  from  those 
for  Ratch  R  MOS,  as  noted  In  the  following  description. 


Table  III. 15 


BARS  Performance  Incident  Workshops:  Number  of  Participants  and 
Incidents  Generated  by  MOS  and  by  Location  -  Batch  A 


Location 

T3B 

MOS 

5T2 - 

Til 

Total  By 
Location 

' 

1 " . 

. 

Fort  Ord 

N  -  Participants 

14 

10 

5 

14 

43 

N  -  Incidents 

195 

80 

59 

213 

547 

Mean  Per  Participant 

13.9 

S.O 

11.8 

15.2 

12.7 

Fort  Polk 

N  -  Participants 

12  * 

15 

15 

15 

57 

PI  -  Incidents  ■ 

150 

240 

210 

235 

835 

Mean  Per  Participant 

12.5 

16.0 

14.0 

15.7 

14.7 

Fort  Bragg 

N  -  Participants 

13 

14 

11 

17 

55 

N  -  Incidents 

235 

221 

218 

225 

899 

Mean  Per  Participant 

18.1 

•  15.8 

19.8 

'  13.2 

16.4 

Fort  Campbell 

N  -  Participants 

13 

13 

10 

11 

47 

N  -  Incidents 

195 

191 

154 

238 

778 

Mean  Per  Participant 

11.5 

13.6 

17.1 

15.9 

14.2 

Fort  Hood 

PI  -  Participants 

13 

13 

10 

11 

47 

N  -  Incidents 

180 

183 

133 

92 

588 

Mean  Per  Participant 

13.9 

14.1 

13.3 

8.4 

10.7 

Fort  Carson 

N  -  Participants 

19 

15 

13 

14 

61 

«  -  Incidents 

204 

232 

215 

180 

831 

Mean  Per  Participant 

10.7 

35.5 

16.5 

12.9 

13.6 

Total  Gy  MOS  . 

M  -  Participants 

88 

81 

S3 

86 

318 

N  -  Incidents 

1159 

1147 

989 

1183 

4478 

Mean  Per  Participant 

13.2 

14.2 

15.7 

13.8 

14.1 

Table  111-16 


BARS  Performance  Incident  Workshops:  Number  of  Participants  and 
Incidents  Generated  by  MOS  and  by  Location  -  Batch  B 


MOS _  Total  By 


Location 

n6 

19E 

31C 

63B 

91A 

Location 

" 

"■ 

■r_ 

'  " 

Fort  Lewis 

fl  -  Participants 

16 

11 

8 

10 

11 

56 

N  -  Incidents 

211 

180 

124 

172 

130 

817 

Mean  Per  Participant 

18.3 

16.4 

15.5 

17.2 

11.8 

14.6 

Fort  Stewart 

N  -  Participants 

14 

15 

15 

16 

16 

76 

N  -  Incidents 

216 

275 

256 

208 

249 

1204 

Mean  Per  Participant 

15.4 

18.3 

17.1 

13.0 

15.6 

15.8 

Fort  Riley 

N  -  Participants 

18 

7 

10 

11 

8 

54 

N  -  Incidents 

216 

123 

127 

133 

90 

689 

Mean  Per  Participant 

12.0 

17.6 

12.7 

12.1 

11.3 

13.8 

Fort  Bragg 

H  -  Participants 

13 

14 

16 

15 

13 

71 

N  -  Incidents 

231 

190 

220 

250 

217 

1,108 

Mean  Per  Participant 

17.8 

13.6 

13.8 

16.7 

16.7 

15.6 

Fort  SHI  a 

fl  -  Participants 

8 

4 

3 

Q 

10 

34 

N  -  Incidents 

26 

'  0 

13 

32 

20 

91 

Mean  Per  Participant 

3.3 

4.3 

3.6 

2.0 

2.7 

Fort  B1  1  ss*-* 

fl  -  Participants 

14 

la 

8 

14 

13 

63 

N  -  Incidents 

93 

70 

39 

71 

55 

328 

Mean  er  Participant 

6.6 

5.0 

4.9 

5.1 

1.2 

5.2 

Total  By  MOS 

fl  -  Participants 

83 

65 

60 

75 

71 

354 

-'1  -  Incidents 

99  3 

838 

779 

866 

761 

4,237 

Mean  Per  Participant 

12.0 

12.0 

13.0 

11.6 

:0.7 

12.0 

Ratranslation  Material  and  Procedure  for  Batch  A.  The  Smith  and  Kendall 
(1963)  procedure  calls  for  Including  Individuals  familiar  with  the  target  job 
as  participants  in  the  retranslation  process.  For  the  Batch  A  MOS,  we 
planned  to  include  the  participants  from  the  earlier  workshops  in  the 
retranslation  phase;  most  of  these  persons  were  supervisors  of  the  target 
Incumbents,  rather  than  Incumbents.  Participants  were  informed  during  the 
workshops  that  we  would  contact  them  via  mail  to  complete  another  phase  of 
the  project. 

After  taking  a  count  of  the  incidents,  we  decided  that  it  was  Impracti¬ 
cal  to  ask  participants  to  rate  all  performance  Incidents  generated  for  their 
MOS j  the  number  of  incidents  per  MOS  ranged  from  761  to  1,183.  Instead,  we 
asked  participants  to  retranslate  only  a  subset  of  the  total  Incidents. 

Return  rates  across  all  Batch  A  MOS  were  such  that,  on  the  average,  only 
about  20%  of  the  participants  completed  the  retranslation  task.  This  number 
of  ratings  proved  insufficient  for  analyses.  To  Increase  the  number  of 
retranslation  ratings,  we  conducted  retranslation  workshops  at  Fort  Meade, 
Maryland,  utilizing  NCOs  from  the  four  MOS  who  were  familiar  with  first-term 
enlistee  job  requirements.  Further,  project  staff  members  from  HumRRO  who 
were  familiar  with  the  job  requirements  of  one  or  more  MOS  also  completed 
retranslation  booklets. 

Procedures  for  Batch  B.  Because  of  the  low  return  rate  for  Batch  A  MOS, 
the  procedures  were  modified  for  Batch  B.  Activities  scheduled  for  the  final 
two  workshops,  conducted  at  Fort  Sill  and  Fort  Bliss,  varied  from  those 
described  previously.  At  these  workshops,  participants  spent  the  first  2 
hours  generating  performance  incidents  describing  MOS-speciflc  job  behaviors, 
then  spent  the  remainder  of  their  day  completing  retranslation  booklets. 

Participants  were  asked  to  complete  as  many  retranslation  booklets  as 
possible.  In  general,  each  Individual  completed  about  one-and-a-half  to  two 
booklets.  Also  during  this  session,  participants  were  asked  to  retranslate 
the  performance  Incidents  generated  earlier  during  that  session.  Hence,  we 
obtained  retranslation  ratings  for  all  performance  incidents  generated  at  the 
first  four  workshops  as  well  as  retranslations  for  the  new  Incidents  gener¬ 
ated  at  that  particular  workshop. 


Construction  of  Initial  Rating  Scales 

Table  III. 17  summarizes  the  number  of  ratings  obtained  from  the  retrans¬ 
lation  exercise  for  Batch  A  and  Batch  B.  The  retranslation  data  were 
analyzed  separately  for  each  MOS.  The  process  included  computing  for  each 
incident  (a)  the  number  of  raters,  (b)  percent  agreement  among  raters  in 
assigning  incidents  to  performance  dimensions,  (c)  mean  effectiveness  rating, 
and  (d)  standard  deviation  of  the  effectiveness  ratings.  Percent  agreement 
values,  mean  effectiveness  ratings,  and  standard  deviations  are  provided  for 
all  performance  incidents  in  Section  3  of  the  MOS  appendixes  in  the  API 
Research  Note  in  preparation. 
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Table  III. 17 


BARS  Retranslation  Exercise:  Number  of  Forms  Developed  for 
Each  MOS  and  Average  Number  of  Raters  Completing  Each  Form 


Number  of 


Number 

Incidents/Form 

Average  Number 

MOS 

of  Forms 

Average  Total 

of  Raters/Form 

Batch  A 


13B 

4 

171 

684 

17.0 

G4C 

5 

191 

955 

12.6 

71L 

4 

190 

760 

14.0 

95B 

5 

229 

1145 

7.6 

Batch  B 

1  IB 

2 

274 

548 

19.0 

19E 

3 

201 

603 

9.7 

31 C 

3 

235 

705 

9.0 

63B 

3 

230 

690 

16.0 

SIA 

3 

210 

630 

17.7 

The  next  step  In  the  process  Involved  Identifying  those  performance 
incidents  In  which  raters  agreed  reasonably  well  on  performance  dimension 
assignment  and  effectiveness  level.  For  each  MOS,  we  Identified  performance 
Incidents  that  met  the  following  criteria:  (a)  at  least  50%  of  the  raters 
agreed  that  the  incident  depicted  performance  In  a  single  performance 
dimension,  and  (b)  the  standard  deviation  of  the  mean  effectiveness  rating 
did  not  exceed  2.0.  These  Incidents  were  then  sorted  Into  their  assigned 
performance  dimensions.  Results  from  this  sorting  are  presented  for  each  MCS 
In  Table  III. 10. 

After  the  Incidents  had  been  sorted  into  performance  dimensions,  the 
percentage  agreement  values  were  examined  to  Identify  dimensions  that  raters 
found  confusing  or  difficult  to  distinguish  from  one  another.  On  the  basis 
of  these  data,  some  dimensions  were  dropped  and  some  were  collapsed. 

After  modifying  the  dimension  system  using  results  from  the  ^transla¬ 
tion  exercise,  wc  developed  behavioral  anchors  for  each  dimension.  This 
involved  sorting  effective  performance  Incidents  with  mean  values  of  G.5  or 
higher,  average 'performance  with  mean  values  of  3.5  to  6.4,  and  Ineffective 
performance  with  mean  values  from  1.0  to  3.4.  We  reviewed  the  content  of  the 
incidents  In  each  of  these  three  areas  and  then  summarized  the  information  In 
each  to  form  three  behavioral  anchors  depicting  effective,  average,  and  in¬ 
effective  performance  (see  example  in  Figure  III. 7). 


Behavioral  Examples  Reliably  Retranslated  Into  Each 
Dimension  on  the  BARS  Measures 


Numbor  of 

Number  of 

Dimansion 

Examolts 

Dimension 

Examples 

Cannon  Crewman  (13b) 

Military  Police  (96B) 

A. 

loading  out  equipment 

49 

A. 

Traffic  control  and  enforcement 

63 

B. 

Driving  and  maintaining  vehicles, 

195 

on  post  and  In  the  field 

howitzers,  and  equipment 

B. 

Providing  escort  security  and 

128 

c» 

Transportlng/sortlng/storlng 

108 

physical  security 

173 

and  preparing  ammunition 
for  fire 

C. 

Making  arrests,  gathering  Information 
on  criminal  activity,  and  reporting 

0. 

Preparing  for  occupation  and 

44 

on  crimes 

236 

emplacing  howitzer 

0. 

Patrolling  and  crime/ accident 

e. 

Setting  up  communications 

24 

prevention  activities 

F. 

Gunnery 

99 

E, 

Promoting  confidence  In  the 

118 

Q. 

Loadlng/unloadlug  howitzer 

32 

military  police  by  maintaining 

H. 

Receiving  and  relaying 

19 

personal  and  legal  standards 

communications 

and  through  community  strvlca  work 

I. 

Recording/ record  Keeping 

29 

F. 

Using  interpersonal  "coimunleatlon 

87 

J. 

Position  Improvement 

14 

(IPCJ  skills 

G, 

Responding  to  medical  emergenclas 

50 

and  other  emergencies  of  a  non¬ 
criminal  nature 

855 

Motor  Tramport  Operator  ((14C) 

Infantryman  (118) 

A. 

Driving  vehicles 

158 

S. 

Vehicle  coupling 

46 

A. 

Ensuring  that  all  suppllas  and 

73 

c. 

Checking  and  maintaining  vahlelts 

181 

•qulpment  are  field-ready  and 

0. 

Using  maps/following  papar  routas 

27 

available  and  well -maintained 

E. 

loading  cargo  and  transporting 

75 

In  thi  field 

33 

personnel 

B. 

Providing  leadership  and/or  taking 

F. 

Parking  and  securing  vehicles 

32 

charge  In  combat  situations 

G. 

Performing  administrative  duties 

42 

C. 

Navigating  and  turvlvlng  In  the  field 

53 

H. 

Self-recovering  vehicles 

20 

0. 

Using  weapons  safely 

38 

I. 

Safety-mi ndedntss 

80 

E. 

Demonstrating  proficiency  In  the  use 

91 

J. 

Performing  dispatcher  duties 

15 

of  all  weapons,  armaments,  equipment 

676 

F. 

and  supplies 

Maintaining  sanitary  conditions, 
personal  hygiene,  and  personal 

24 

Administrative  Specialist  ( 7 1L ) 

safety  In  the  field 

G. 

Preparing  a  fighting  position 

29 

A. 

Preparing,  typing,  and 

183 

H. 

Avoiding  enemy  detection  during 

22 

proofreading  documents 

movement  and  In  established  defensive 

e. 

(31  strlbutlng  and  dispatching 

63 

positions 

Incoming/outgoing  documents 

73 

I. 

Operating  a  radio 

27 

c. 

maintaining  office  resources 

J. 

Performing  reconnaissance  and  patrol 

37 

0. 

Posting  regulations 

44 

activities 

E. 

Establishing  and/or  maintaining 

50 

k. 

Performing  guard  and  sacurlty  duties 

75 

files  IAW  TAFFS 

94 

L. 

Demonstrating  courage  and  provlclency 

5 

F. 

keeping  records 

In  engaging  the  enemy 

G. 

Safeguarding  and  monitoring 

43 

.  M. 

Guarding  the  processing  POWs  and 

15 

security  of  classified  materials 

enemy  casualties 

57? 

H. 

Providing  customer  service 

30 

I. 

Preparing  special  reports, 
documents,  drafts,  and  other  materials 

19 

Sorting,  routing  and  distributing 
incoming/outgoing  mail 

28 

k. 

Maintaining  Army  Post  Office 
equipment 

2 

1. 

keeping  Post  Office  records 

20 

M. 

Maintaining  security  of  mall 

9 

5573 

(Continued) 


Table  III. 18  (Continued) 


Behavioral  Examples  Reliably  Retranslated  Into  Each 
Dimension  on  the  BARS  Measures 


Dimension 

Armor  Crewman  { 19E ) 


Number  of 
Examples 


Dimension 

Light-Wheel  Vehicle  Mechanic  (63B) 


Number  of 
Examples 


Maintaining  tank  hull/suspenslon 
system  and  associated  equipment 
Maintaining  tank  turret  system/ 
fire  control  system 
Orlvlng/recoverlng  tanks 
Stowing  and  handling  amunltlon 
Loading/unloading  guns 
Maintaining  guns 
Engaging  targets  with  tank  guns 
Operating  and  maintaining 
communication  equipment 
Establishing  security  In  the  field 
Navigating 

Preparing/securing  tank 


Inspecting,  testing,  and  detecting 

problems  with  equipment 

Troubleshooting 

Performing  routine  maintenance 

Repair 

Using  tools  and  test  equipment 
Using  technical  documentation 
Vehicle  and  equipment  operation 
Recovery 

Plannlng/organlaing  Jobs 
Administrative  duties 
Safety  mindedness 


Medical  Specialist  (91A) 


lo  Teletype  Operator  (31C) 

A. 

Maintaining  and  operating 

Army  vehicles 

51 

Inspecting  equipment  and  trouble* 
shooting  problems 

SO 

B. 

Maintaining  accountability  of 
medical  supplies  and  equipment 

28 

Pulling  preventative  maintenance 

79 

C. 

Keeping  medical  records 

31 

and  servicing  equipment 

162 

0. 

Attending  to  patients'  concerns 

15 

Installing  and  preparing  equipment 
for  operation 

E. 

Providing  accurate  diagnoses  In  a 
clinic,  hospital,  or  field  setting 
Arranging  for  transportation  and/or 
transporting  Injured  personnel 

11 

Operating  communications  devices 
and  providing  for  an  accurate  and 

142 

F. 

44 

timely  flow  of  Information 

6. 

Dispensing  medications 

42 

Preparing  reports 

Maintaining  security  of  equipment 

33 

57 

H. 

Preparing  and  Inspecting  field  site 
or  clinic  facilities  In  the  field 

34 

and  Information 

Locating  and  providing  safe  transport 

SO 

I. 

Providing  routlnt  and  ongoing  patient 

cart 

95 

of  equipment  to  sites 

m 

J. 

Responding  to  emergency  situations 

142 

K. 

Providing  Instruction  to  Army  personnel 

18 

TIT 
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A.  TRAFFIC  CONTROL  AND  ENFORCEMENT 
Controlling  traffic  and  enforcing  traffic  laws  and  parking  rules, 


•  Often  uses  hand/arm 
signals  that  are  dif¬ 
ficult  to  understand, 
at  times  resulting 

In  unnecessary  acci¬ 
dents;  often  falls  to 
wear  reflectorized 
gear;  overlooks 
hazardous  traffic 
conditions;  sleeps 
on  duty;  pays  exces¬ 
sive  attention  to 
things  unrelated  to 
the  job. 

•  May  display  excess 
leniency  or  harsh¬ 
ness  when  citing  of¬ 
fenders,  allowing 
their  military  rank, 
race,  and/or  sex  to 
Influence  his/her 
actions;  makes  many 
errors  when  filling 
out  citations. 


e  Usually  does  a  rea¬ 
sonable  job  when  di¬ 
recting  traffic  by 
using  adequate  hand/ 
arm  signals  and/or 
wearing  reflectorized 
gear. 


•  Makes  few  errors 
when  filling  out 
citations;  usually 
does  not  allow  an 
offender’s  race, 
sex,  and/or 
military  rank  to 
interfere  with 
good  judgment. 


•  Consistently  uses 
appropriate  hand/ 
arm  signals;  always 
wears  reflectorized 
gear;  generally 
monitors  traffic 
from  plain -view 
vantage  points; 
consistently  re¬ 
frains  from  behav¬ 
iors  such  as  reading 
and  prolonged  con¬ 
versation  on  non¬ 
job  related  topics. 

e  Always  uses  emergency 
equipment  (e.g., 
flares,  barricades) 
to  highlight  unsafe 
conditions  and  en¬ 
sures  that  hazards 
are  removed  or  other¬ 
wise  taken  care  of. 


Figure  III. 7.  Sample  Behavioral  Summary  Rating  Scale  for 
Military  Police  (95B). 


It  Is  Important  to  note  that  for  each  MOS  we  developed  Behavioral 
Summary  Scales.  Traditional  behavlorally  anchored  rating  scales  contain 
specific  examples  of  job  behaviors  for  each  effectiveness  level  In  a 
performance  dimension.  Behavioral  Summary  Scales,  on  the  other  hand,  contain 
anchors  that  represent  the  behavioral  content  of  ALL  performance  Incidents 
reliably  retranslated  for  that  particular  level  of  effectiveness.  This  makes 
it  more  likely  that  a  rater  using  the  scales  will  be  able  to  match  observed 
performance  with  performance  on  the  rating  scale  (Borman,  1979). 
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After  developing  the  performance  rating  scales  for  each  MOS,  we 
submitted  the  scales  to  review  by  a  project  research  staff  member  familiar 
with  the  development  process.  Results  from  this  review  were  used  to  clarify 
performance  definitions  and  behavioral  anchors.  The  final  set  of  performance 
rating  scales  administered  In  field  test  sessions  are  Included  In  Section  4 
of  the  MOS  appendixes  In  the  ARI  Research  Note  In  preparation. 


Revisions  After  Retranslatlon 

The  categorization  of  the  original  critical  Incident'  pool  produced  a 
total  of  93  Initial  performance  dimensions  with  a  range  of  7-13  dimensions 
per  MOS.  Based  on  the  retranslation  results,  a  number  of  the  original 
performance  dimensions  were  redefined,  omitted,  or  combined.  From  the 
original  set,  six  were  omitted  and  four  were  lost  through  combination.  One 
of  the  omissions  was  due  to  the  fact  that  too  few  critical  Incidents  were 
retranslated  Into  It  by  the  judges.  The  other  five  were  omitted  because  the 
factor  represented  tasks  that  were  well  beyond  Skill  Level  1  or  were  from  a 
very  specialized  low-density  "track"  within  the  MOS  (e.g.,  MOS  71L  F5-Postal 
Clerk) . 


Field  Test  Versions  of  MQS-Speclflc  BARS 

In  sum,  the  results  from  the  retranslatlon  exercises  were  used  to  eval¬ 
uate  and  modify  the  performance  dimension  system  that  had  been  developed  for 
each  MOS.  The  final  set  of  behavlorally  anchored  rating  scales  for  the  nine 
MOS  for  use  In  the  field  test  contained  from  6  to  12  performance  dimensions. 
Each  of  the  performance  dimensions  Includes  behavioral  anchors  describing 
Ineffective,  average,  and  effective  performance.  Raters  were  asked  to  use 
these  anchors  to  evaluate  ratees  on  a  scale  ranging  from  1  (ineffective  per¬ 
formance)  to  7  (effective  performance). 

Before  the  rating  scales  were  tried  out  In  the  field,  one  additional 
scale  was  constructed  for  each  MOS  rating  booklet.  On  this  scale  raters  are 
asked  to  evaluate  an  Incumbent's  overall  performance  across  all  MOS-specIflc 
performance  dimensions.  This  final  rating  scale  is  virtually  the  same  for 
all  MOS;  It  includes  three  anchors  depicting  Ineffective,  average,  and 
effective  performance. 

Rating  scale  booklets  that  provided  raters  with  performance  dimension 
titles,  definitions,  and  behavioral  anchors  were  assembled  for  each  MOS.  The 
rating  booklets  were  designed  so  that  raters  could  evaluate  up  to  five  ratees 
In  each.  The  booklets  do  not  Include  Instructions  for  using  the  scales  to 
make  performance  ratings;  Instead,  oral  Instructions  were  given  during  the 
field  test  rating  sessions. 

The  field  test  samples  and  procedures  are  described  In  Section  8.  Field 
test  results  and  subsequent  modifications  of  the  BARS  are  described  In 
Section  11. 
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Section  5 

DEVELOPMENT  OF  ARMY-WIDE  RATING  SCALES 1 

The  principal  objective  for  this  part  of  Project  A '  s  criterion 
development  work  Is  to  construct  a  set  of  critical  Incident-based  rating 
scales  that  will  assess  the  major  performance  factors  In  the  Army-wide,  or 
non-job-specific,  portion  of  the  total  performance  space.  Another  objective 
1$  to  develop  rating  scales  that  focus  on  specific  common  tasks  that  all 
first-term  soldiers  are  required  to  perform.  The  procedures  for  developing 
each  of  these  two  kinds  of  rating  scales  will  be  described  In  turn. 


Development  of  Army-Wide  Behavior  Rating  Scales 

The  development  of  the  Army-wide  behavior  rating  scales  followed  the 
same  general  procedure  as  for  the  MOS-spedflc  BARS  (described  In  Section  4) 
and  those  details  will  not  be  repeated  here.  What  Is  presented  below  are  the 
procedures  and  findings  that  are  specific  to  the  Army-wide  scales. 


Behavior  Analysis  Workshops  and  Procedures 

Seventy-seven  officers  and  NCOs  participated  In  six  1-day  workshops 
Intended  primarily  to  elicit  behavioral  examples  of  soldier  effectiveness 
that  were  not  MOS-specIflc.  Table  III. 19  describes  the  workshop  participant 
groups.  A  total  of  1,315  behavioral  examples  were  generated  In  the  six 
workshops.  Details  relevant  to  this  data  collection  appear  In  Table  III. 20. 

Duplicate  examples  and  examples  that  did  not  meet  the  criteria  specified 
(e.g.,  the  incident  described  the  behavior  of  an  NCO  rather  than  a  first-term 
soldier)  were  dropped  from  further  consideration.  The  remaining  1,111 
examples  were  edited  to  a  common  format  and  content  analyzed  by  project  staff 
to  form  preliminary  dimensions  of  soldier  effectiveness.  Specifically,  three 
researchers  Independently  read  each  example  and  grouped  together  those 
examples  that  described  similar  behaviors.  The  sorted  examples  were  then 
reviewed  and  the  groupings  were  revised  until  each  author  arrived  at  a  set  of 
dimensions  that  were  homogeneous  with  respect  to  their  content. 

After  discussion  among  project  staff  and  with  a  small  group  of  officers 
and  NCOs  at  Fort  Benning,  a  consensus  was  reached  on  a  set  of  13  dimensions. 
These  were  then  submitted  to  retranslation. 


Retranslation  of  the  Behavioral  Examples 

The  retranslation  task  was  divided  into  five  parts,  with  each  part 
l  requiring  a  judge  to  evaluate  21G-225  behavioral  examples.  Judges  were 


1  Th  1  s  section  is  based  primarily  on  ARI  Technical  Report  716,  Development 
a n d  Field  Test  of  Army-Wide  Rating  Sea  1 es  and  the  Rater  Orientation  amd 
Tr'a  I n i ng ~Pr ogr am ,  HaTne  EH  Pulakos  and "  Wa  1  ter  ETE  Borman  (Td s . ) ,  and  tK e 
supplementary  ARI  Research  Note  07-22,  which  contains  the  report  appendixes. 
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Table  III. 19 

Participants  In  Behavioral  Analysis  Workshops  for  Amy-Wide  Rating  Scales 


Table  III.20 

Soldier  Effectiveness  Examples  Generated  for  Army-Wide  Behavior  Rating  Scales 


Location 


Participants 


Number  of 
Examples 


Mean  Number 
of  Examples 
Per  Participant 


f 

i 


V 

$ 


provided  with  definitions  of  each  of  13  dimensions  to  aid  in  the  sorting,  and 
with  a  1-9  effectiveness  scale  (1  =  extremely  ineffective;  5  =  adequate/ 

average;  9  «  extremely  effective)  to  guide  the  effectiveness  ratings.  The 
retranslatlon  materials,  Including  all  1,111  edited  behavioral  examples, 
appear  In  Appendix  B,  ARI  Research  Note  In  preparation.  Sixty-one  officer 
and  NCO  judges  completed  retranslatlon  ratings. 


Retranslatlon  Results 

Table  III. 21  shows  the  number  of  behavioral  examples  reliably  retrans¬ 
lated  for  each  of  the  13  dimensions.  The  criteria  established  for  accept¬ 
ance-greater  than  5054  agreement  for  the  sorting  of  an  Incident  Into  a  single 
dimension,  and  a  standard  deviation  of  less  than  2.0  for  the  distribution  of 
judges'  effectiveness  ratings  for  one  Incident-left  870  of  the  1,111  exam¬ 
ples  (78<)  Included  for  subsequent  scale  development  work. 

The  results  In  Table  III. 21  were  seen  as  satisfactory,  In  that 

sufficient  numbers  of  reliably  retranslated  examples  were  available  to 
develop  behavioral  definitions  of  each  dimension.  Two  pairs  of  dimensions 
were  combined,  resulting  In  a  total  of  11  Army-wide  dimensions.  Leading 
Other  Soldiers  and  Supporting  Other  Unit  Members  were  combined  to  form 
Leadlng/Supportlngj  Attending  to  Detail  and  Maintaining  Own  Equipment  were 
collapsed  to  form  Maintaining  Assigned  Equipment.  The  two  combinations 
seemed  appropriate  because  of  the  conceptual  similarity  of  each  of  the 
dimension  pairs. 

For  each  of  the  11  dimensions,  the  reliably  retranslated  behavioral 

examples  were  then  divided  Into  three  categories  of  effectiveness  levels,  and 
behavioral  summary  statements  were  written  to  capture  the  content  of  the 
specific  examples  at  low  (1-3.49),  average  (3.5-6.49),  and  high  (6.5-9) 
performance  levels.  Development  of  the  behavioral  summary  statements  Is  the 
critical  step  In  forming  Behavior  Summary  Scales  (Borman,  1979). 


Additional  Army-Wide  Scales 

In  addition  to  the  11  Army-wide  BARS,  two  summary  rating  scales  were 
prepared.  First,  an  overall  effectiveness  scale  was  developed  to  obtain 
overall  judgments  of  a  soldier's  effectiveness  based  on  all  of  the  behavioral 
dimension  ratings.  Second,  an  NCO  potential  scale  was  developed  to  assess 
each  soldier's  likelihood  of  being  an  effective  supervisor  as  an  NCO. 


Final  list  of  Army-Wide  Behavioral  Rating  Scales 

The  11  Army-wide  BARS  that  were  retained  plus  the  overall  performance 
and  NCO  potential  scales  provided  the  following  behavioral  rating  scales  for 
the  field  test: 

A.  Technical  Knowledge/Skill 

B.  Effort 

C.  Following  Regulations  and  Orders 
0.  Integrity 

E.  Leadership 


TaHe  HI. 21 


Behavioral  Examples  Reliably  Retranslated3  Into  Each  Dimension 
for  Arrry*W1de  Behavior  Rating  Scales 


Number  of 

Dimensions  Examples 


A. 

Controlling  own  behavior  relateo  to  personal 
finances,  drugs/alcohol,  and  aggressive  acts 

107 

B. 

Adhering  to  regulations  and  SOP,  and  displaying 
respect  for  authority 

158 

C. 

Displaying  honesty  and  Integrity 

53 

D. 

Maintaining  proper  military  appearance 

34 

E. 

Maintaining  proper  physical  fitness 

36 

F. 

Maintaining  own  equipment1* 

46 

G. 

Maintaining  living  and  work  areas  to 

Ariny-unlt  standards 

23 

H. 

Exhlbltlng’technlcal  knowledge  and  skill 

47 

I. 

Showing  Initiative  and  extra  effort  on  job/ 
mission/assignment 

131 

J. 

Attending  to  detail  on  jobs/assignments/ 
equipment  checks** 

59 

K. 

Developing  own  job  and  soldiering  skills 

40 

L. 

Effectively  leading  and  providing  motivation 
to  other  soldiers0 

71 

M. 

Supporting  other  unit  members0 

65 

870 

3  Examples  were  retained  if  they  were  sorted  Into  a  single  dimension  by 
greater  than  50%  of  the  retranslation  raters  and  had  standard  deviations  of 
their  effectiveness  ratings  of  less  than  2.0. 

b  These  two  dimensions  were  subsequently  combined  to  form  a  Maintaining 
Assigned  Equipment,  dimension. 

c  These  two  dimensions  were  subsequently  combined  to  form  a  Leadership 
dlmensl on . 
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F.  Maintaining  Assigned  Equipment 

G.  Maintaining  Living/Work  Areas 

H.  Military  Appearance 

I.  Physical  Fitness 

J.  Self -Development 

K.  Self-Control 
Overall  Effectiveness 
HCO  Potential 


Development  of  Army-Wide  Common  Task  Dimensions 

Rating  scales  covering  the  common  task  domain  were  developed  from  tasks 
appearing  in  the  Skill  Level  1  Common  Task  Soldier's  Manual.  Because  this 
manual  specifies  tasks  that  all  first-term  soldiers  are  expected  to  te  able 
to  perform,  It  seemed  an  appropriate  source  of  Army-wide  common  task 
dimensions. 

To  develop  these  dimensions,  a  senior  staff  member  content  analyzed  the 
specific  tasks  contained  in  the  manual  (e.g.,  Read  and  Repoit  Total  Radiation 
Dose;  Repair  Field  Wire)  and  Identified  1?  common  task  areas  that  appeared  to 
reflect  In  summary  form  all  of  the  specif ic~ 'tasks.  Examples  of  common  task 
areas  are  See:  Estimating  Range  and  Combat  Techniques:  Moving  Wilder  Direct 
Fire. 


Ratings  consisted  of  evaluating  how  well  each  ratee  typically  performed 
each  task  on  a  7-point  scale,  from  1  ■  "Poor:  does  not  meet  standards  and 
expectations  for  adequate  performance  in  this  task  area"  to  7  *  "Excellent: 
exceeds  standards  and  expectations  for  performance  in  this  task  area."  In 
addition,  raters  were  given  the  option  of  choosing  a  "O,"  indicating  that 
they  had  not  observed  a  soldier  performing  in  the  task  area.  The  13  common 
task  dimensions  are: 

A.  See:  Identifying  Threat  (armored  vehicles,  aircraft) 

B.  See:  Estimating  Range 

C.  Communicate:  Send  a  Radio  Message 

P.  Mavlgate:  Using  a  Map 

E.  Mavlgate:  Navigating  in  the  Field 

F.  Shoot:  Performing  Operator  Maintenance  Weapon  (e.g.,  M 1 6  rifle) 

G.  Shoot:  Engaging  Target  With  Weapon  (e.g.,  Mlf) 

H.  Combat  Techniques:  Moving  Under  Direct  Fire 

I.  Combat  Techniques:  Clearing  Fields  of  Fire 

J.  Combat  Techniques:  Camouflaging  Self  and  Equipment 

K.  Survive:  Protecting  Against  NBC  Attack 

L.  Survive:  Performing  First  Aid  on  Self  and  Other  Casualties 

M.  Survive:  Knowing  and  Applying  the  Customs  and  Laws  of  War 

Field  Test  Instruments 

On  the  basis  of  the  above  development  steps,  the  Army-wide  EARS  scales 
and  the  Common  Task  Rating  Scales  were  deemed  ready  for  field  testing  In  the 
Batch  A  and  Batch  B  MOS.  Field  test  procedures  are  described  in  Section  P, 
and  field  test  results  and  subsequent  modifications  to  the  instruments  are 
described  in  Section  12. 
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Section  6  i 

DEVELOPMENT  OF  THE  COMBAT  PERFORMANCE  PREDICTION  RATING  SCALE*  I 


This  section  describes  the  development  of  a  combat  performance 
prediction  scale,  designed  to  evaluate  performance  under  degraded  conditions 
and  the  Increased  confusion,  workload,  and  uncertainty  of  a  combat  environ¬ 
ment.  Such  conditions  would  be  expected  for  many  soldiers  near  a  battle 
area,  even  though  it  Is  likely  that  only  a  small  percentage  of  the  total  Army 
force  will  directly  participate  In  combat.  Clearly,  a  soldier's  judged 
effectiveness  In  a  combat  environment  represents  a  potentially  Important 
Indicator  of  overall  effectiveness  {Sadacca  &  Campbell,  1985). 

This  scale,  like  the  Army-wide  rating  scales,  was  Intended  to  be 
appropriate  for  any  MOS.  It  Is  the  only  criterion  that  specifically 
addresses  combat  performance  for  all  Project  A  MOS.  It  Is  also  the  only 
Instrument  expressly  designed  to  measure  performance  under  adverse 
conditions. 

In  developing  this  rating  scale,  we  recognized  that  this  rating  task  may 
pose  some  unusual  difficulties  for  raters.  First,  although  raters  may  often 
observe  soldiers  In  garrlson/fleld  performance,  opportunities  to  observe 
performance  under  adverse  conditions  may  have  been  limited.  Second,  the 
majority  of  peer  and  supervisor  raters  have  never  experienced  combat,  so  they 
are  being  asked  to  predict  how  soldiers  would  perform  In  a  situation  that  the 
raters  themselves  may  not  know  first-hand. 

Unlike  the  Arn\y-w1de  rating  scales,  which  are  behavioral  summary  scales, 
the  Combat  Performance  Prediction  Scale  takes  the  form  of  a  summated  scale. 
This  type  of  Instrument  Is  a  series  of  scaled  Items  (critical  Incidents), 
each  followed  by  a  response  format.  The  Items  represent  the  positive  and  the 
negative  aspects  of  each  behavioral  dimension.  Items  are  presented  In  random 
order  (across  dimensions)  on  the  rating  form  to  preclude  a  response-set  bias 
(for  either  the  dimension  or  the  direction  of  the  Item). 

A  major  consideration  in  selecting  the  summated  format  for  the  predic¬ 
tion  scale  was  the  expected  high  correlations,  attributable  primarily  to 
method  variance,  between  this  scale  and  the  Army-wide  scales  If  similar 
formats  had  been  used  for  the  two  types.  This  was  of  particular  concern 
given  the  subjective  nature  of  the  judgments  that  raters  would  be  asked  to 
make  on  the  prediction  scale.  Another  consideration  was  that  we  felt  It  was 
more  reasonable  to  ask  raters  how  likely  it  was  that  the  soldiers  they  were 
rating  would  perform  a  given  act,  than  to  ask  them  to  predict  whether  or  not 
these  soldiers  would  actually  perform  the  act  at  a  particular  performance 
level,  under  combat  conditions.  Summing  across  rating  items  (acts)  yields  a 
score  that  measures  the  rater's  assessment  of  the  probability  of  how  the 
ratees  would  act  under  combat-like  conditions. 


* Thl s  section  Is  based  primarily  on  an  unpublished  manuscript,  "Development 
of  Combat  Performance  Prediction  Scales,"  by  Barry  Rlegelhaupt,  and  Robert 
Sadacca. 
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Development  of  a  Conceptual  Framework 


The  starting  point  for  our  combat  scale  development  work  was  to  build  a 
conceptual  model  of  combat  effectiveness.  We  began  with  a  set  of  behaviors 
that  were  not  directly  related  to  task  performance,  but  were  related  to  the 
broader  concept  of  individual  effectiveness  in  combat.  In  particular, 
elements  that  would  be  potentially  Important  contributors  to  organizational 
effectiveness  in  Army  combat  units  were  considered.  From  the  Army's  perspec¬ 
tive,  being  a  good  combat  soldier  means  performing  tasks  in  a  technically 
proficient  manner  and  displaying  such  characteristics  as  motivation,  personal 
discipline,  and  physical  fitness  that  are  valued  Army-wide.  Within  this 
framework,  there  may  be  additional  elements  that  contribute  to  a  soldier's 
combat  effectiveness  in  the  unit.  The  initial  step  of  developing  a  con¬ 
ceptual  framework  was  seen  as  useful  for  guiding  thinking  during  subsequent 
empirical  work  to  identify  and  define  all  elements  of  the  combat  effective¬ 
ness  domain. 


The  preliminary  set  of  combat  performance  dimensions  is  $hown  in  Figure 
III. 8.  They  are  the  result  of  preliminary  hypotheses  about  behaviors  that 
might  be  important  to  combat  effectiveness.  They  were  developed  from  a 
review  of  relevant  literature  (Anderson,  1984,  a,b,c;  Brown  &  Jacobs,  1970; 
Fiedler  &  Anderson,  1983;  Frost,  Fiedler,  &  Anderson,  1983;  Henrlksen  et  al., 
1980;  Hollander,  1954,  1955;  Kern,  1965;  Sterling,  1984)  and  insights  pro¬ 
vided  by  combat  veterans  on  the  Project  A  staff. 

While  the  conceptual  framework  was  considered  Important  to  subsequent 
development,  we  also  believed  strongly  that  an  empirical  strategy  should  be 
used  to  examine  the  combat  effectiveness  domain.  Accordingly,  a  variant  of 
the  critical  incidents  or  behavioral  analysis  (Smith  ft  Kendall,  1963) 
approach  was  employed  to  Identify  dimensions  of  combat  effectiveness.  The 
many  behavioral  examples  emerging  from  this  step  were  content  analyzed,  and 
then  submitted  to  a  retranslation  and  scaling  procedure.  Following  field 
testing,  the  best  items  were  selected  and  the  scale  to  be  used  in  Concurrent 
Validation  was  developed. 


Critical  Incident  Workshops 

The  inductive  behavioral  analysis  strategy  (Campbell,  Dunnette,  Arvey,  & 
He  1 1 ev v Ik,  1973)  requires  persons  familiar  with  a  job's  performance  demands 
to  generate  examples  of  effective,  mid-range,  and  ineffective  behavior 
observed  on  that  job.  In  the  present  application,  "job  behavior"  was  defined 
broadly  as  any  action  related  to  combat  effectiveness.  Officer  and  MC 0 
participants  in  critical  incident  workshops  were  askeJ  to  provide  behavioral 
examples  (positive  and  negative)  relevant  to  first-term  combat  effective¬ 
ness;  examples  were  to  be  appropriate  for  and  applicable  to  any  MOS. 

Forty-six  officers  and  MCOs  participated  in  one  of  the  four  1-day 
critical  incident  workshops.  All  participants  were  combat  veterans,  the 
large  majority  with  experience  in  Vietnam.  In  each  workshop,  the  leader,  a 
member  of  the  Project  A  research  staff,  first  described  Project  A  and 


A.  Esprit  de  corps 

Ability  and  desire  to  foster  a  common  spirit  of  devotion  and  enthusiasm 
among  members  of  a  group/unit;  identification  with  group/unit  goals; 
commitment  to  maintaining  and  enhancing  the  reputation  of  the  unit. 

B.  Inltlat'i  ve/FlexI  bl  1 

Ability  and  willingness  to  identify  and  seize  the  opportunity  to  create 
novel  solutions  to  combat  problems;  reasoned  acceptance  of  risk. 

C.  Intel llqence/Common  Sense 


Ability  to  size  up  a  situation  accurately  by  using  all  available  informa¬ 
tion;  willingness  to  evaluate  the  opinions  of  experienced  personnel  before 
making  decisions. 

D.  Commltment/Devotlon/Responsibil  It 


Willingness  to  sacrifice  personal  gain  for  the  good  of  the  unit  and  its 
members;  devotion  to  accomplish  one's  duty;  willingness  to  take  responsi¬ 
bility  for  the  safety  of  self  and  others,  for  the  maintenance  of  weapons 
and  equipment,  etc. 

E .  Physical  and  Moral  Courage 

Ability  to  face  danger  with  confidence  and  emotional  stability. 

F.  Obedience/Allegiance  to  Superiors 


Ability  and  willingness  to  obey  orders,  for  example,  to  advance  on  enemy 
positions,  to  dig  In,  etc. 

G.  Tactical/Technical  Knowledge 


Ability  to  follow  standard  operating  procedures;  knowledge  of  and  ability 
to  coordinate  weapons,  ammunition,  equipment,  and  personnel. 

H.  Psychological/Physical  Effects  of  Combat 


Reaction  to  stress  associated  with  shooting  and  killing  enemy  soldiers, 
losing  a  team/unit  leader,  seeing  others  wounded  or  killed,  waiting  for 
orders  between  battles,  uncertainty  of  the  situation,  etc. 

I.  Interpersonal  Communications 

Ability  to  Interact  with  others  on  a  one-to-one  or  group  level. 

J.  Deci si veness 

Ability  to  make  decisions  based  often  on  limited,  incomplete,  and  unreli¬ 
able  information. 

K.  Personal  Example 

Ability  to  set  a  good  personal  example  for  others. 


Figure  I II. 8.  Preliminary  set  of  cc 
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explained  how  the  prediction  of  combat  performance  was  an  integral  part  of 
the  project.  Participants  were  then  led  into  a  discussion  of  how  combat 
effectiveness  could  be  defined  In  terms  of  more  specific  dimensions--that  Is, 
what  categories  can  be  used  to  define  what  is  meant  by  combat  effectiveness? 

The  workshop  leader  next  presented  the  preliminary  set  of  dimensions 
(see  Figure  III. 8)  and  discussed  overlap,  semantic  differences,  and  possible 
additions.  This  approach  permitted  workshop  participants  to  think  about 
combat  effectiveness  from  their  own  perspective,  and  then  compare  that  with 
our  notions.  Perhaps  the  most  Important  function  served  was  to  establish  a 
context  for  the  behavioral  examples  the  participants  would  be  writing. 

The  workshop  leader  then  distributed  the  Instructions  on  how  to  write 
behavioral  examples.  These  materials  had  a  modeling  orientation  showing 
participants  Improperly  written  examples  and  then  these  examples  corrected  to 
the  proper  form.  After  review  of  these  materials,  participants  were  asked  to 
write  a  behavioral  example,  which  was  reviewed  and  corrected  as  needed  by  the 
workshop  leaders.  Except  for  periods  taken  to  discuss  behavioral  examples  or 
effectiveness  dimensions  emerging  from  the  content  of  the  examples,  the  rest 
of  each  workshop  was  devoted  to  participants  writing  and  leaders  reviewing 
the  examples. 

A  total  of  361  behavioral  examples  was  generated  In  the  four  workshops 
(Table  III. 22),  After  duplicative  examples  and  those  that  were  specific  to 
officers,  MOS,  or  equipment  were  eliminated,  158  usable  examples  remained. 
Since  some  of  these  examples  might  be  eliminated  during  subsequent  scale 
development  work,  it  was  desirable  to  have  a  larger  set  of  items  available. 
A  review  of  a  set  of  examples  that  had  been  used  in  the  Army-wide  rating 
scale  retranslation  workshops  revealed  73  that  described  behavior  In  a 
combat-type  situation;  most  of  them  described  effective  or  Ineffective 
behavior  under  adverse  conditions  during  training  and  field  exercises.  These 
examples  were  added  to  the  158  usable  examples  from  ;'.ie  combat  workshops. 
The  distribution  is  shown  In  Table  III. 23. 

The  examples  were  edited  to  a  common  format  and  used  to  revise  the  pre¬ 
liminary  list  of  dimensions  of  combat  effectiveness.  Three  researchers 
independently  read  each  example  and  grouped  those  that  described  similar 
behaviors.  The  examples  were  then  reviewed  and  the  groupings  revised  until 
the  researchers  arrived  at  a  set  of  homogeneous  behavior  categories.  The 
content  analysis  of  the  Incidents  resulted  in  a  reduction  of  the  number  of 
dimensions  from  11  to  8.  The  revised  dimensions  are  shown  in  Figure  III. 9. 
Employing  the  eight  dimensions  and  231  behavioral  examples,  materials  were 
developed  for  retranslation  and  scaling  workshops. 


Retranslation  and  Scaling  Workshop 


Retranslation  provides  a  way  of  checking  on  the  clarity  of  Individual 
behavioral  examples  and  of  the  dimension  system.  In  retranslation,  persons 
familiar  with  the  target  domain  make  two  judgments  about  each  example:  (a) 
the  dimension  or  category  It  belongs  to  based  on  Its  content,  and  (b)  the 
level  of  effectiveness  or  Ineffectiveness  It  reflects.  Examples  for  which 
there  Is  disagreement  either  on  rateoory  membership  or  nri  effectiveness  level 
may  not  be  stated  clearly,  and  may  need  to  be  revised  or  eliminated  from 
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further  consideration.  Also,  confusion  between  two  or  more  content  cate¬ 
gories  in  the  sorting  of  several  examples  may  reflect  poorly  formed  and/or 
defined  aspects  of  the  dimension  system. 

The  retrcnslatlon  task  was  performed  by  16  officer  and  NCO  judges,  all 
of  whom  were  combat  veterans.  Judges  were  provided  with  definitions  of  each 
dimension  to  aid  In  sorting  behaviors,  and  a  1-9  effectiveness  scale 
(1  *  extremely  Ineffective;  5  *  average  effectiveness;  and  9  ■  extremely 
effective)  for  use  in  rating  the  level  of  positive  or  negative  performance. 


Table  III.22 

Combat  Performance  Workshop  Participants  and  Examples  Generated 


Workshop 

Participants 

Number  of 

Examples  Generated 

1 

11  Field  Grade  Officers 

80 

2 

10  NCOs  • 

32 

3 

15  Field  Grade  Officers 

166 

4 

10  NCOs 

83 

Total 

46 

361 

Table  III. 23 

Number  of  Edited  Examples 

of  Combat  Behavior 

Combat 

Workshops 

Army-Wide 

Workshops 

Total 

Positive 

96 

42 

138 

Negative 

62 

31 

93 

Total 

158 

73 

231 

"'I 


A.  Coheslon/Commltment  to  Others 

•  Ability  and  desire  to  Foster  a  common  spirit  of  devotion  and 
enthusiasm  among  members  of  a  group 

•  Concern  for  the  physical/emotional  welfare  of  the  Individual 
members  of  the  group 

•  Commitment  to  maintaining/enhancing  the  effectiveness  of  the  group 

B.  Intel! loence/Common  Sense 

•  Ability  to  learn  quickly  and  apply  the  newly  acquired 
knowledge/skill  In  a  novel  situation 

»  Ability  to  size  up  a  situation  and  use  available  resources  to  make 
a  decision 

•  The  exercise  of  appropriate  judgment 

C.  Self-Dlsclpllne/Responslblllty 

•  tTTTTTngness“to  accept  responsibility  for  the  accomplishment  of  the 
task  at  hand 

•  Concern  for  conditions  that  jeopardize  the  safety  of  self  and 
ethers 

t  Concern  for  the  maintenance  of  weapons  and  equipment,  etc. 

D.  Physical /Medical  Condition 

■  Ability  and  willingness'  to  maintain  both  physical  and  medical 
fitness 

•  Physical  endurance  as  demonstrated  by  little  or  no  reduction  In 
performance  even  after  or  during  prolonged  or  strenuous  activities 

'*  Concern  for  proper  health  care/hygiene  to  avoid  sickness  and 
disease 

E.  Mission  Orientation 

•  Willingness  to  make  sacrifices  and  endure  hardships  to  accomplish 
mission 

e  Commitment  and  dedication  to  accomplishing  one's  assigned 
dutles/responslbl 1 1  ties 

•  Willingness  to  accept  a  reasonable  amount  of  risk  In  the  pursuit 
of  mission  accomplishment 

F.  Technical/Tactical  Knowledge 

•  SFilit,  to  follow  SdA 

•  Knowledge  of  and  ability  to  coordinate  weapons,  ammunition,  and 
equipment 

•  Ability  to  perform-MOS  specific  and  common  soldiering  tasks 

G.  Psychological  Effects  of  Combat 

e  Reaction  to  stress  associated  with  shooting  and  killing,  losing  a 
unit/team  leader,  seeing  others  wounded  or  killed,  waiting  for 
orders  between  engagements,  etc. 

•  Ability  to  perforin  duties  with  little  or  nn  decrement  under 
emotionally  stressful  situations 

H.  Initiative 

•  Abi 1  Tty  and  willingness  to  take  the  appropriate  action  at  the 
appropriate  time  without  being  told  to  do  so 

Figure  III. 9  Revised  set  of  combat  performance  dimensions. 
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Acceptable  agreement  was  defined  as  greater  than  50?  of  the  judges 
sorting  an  example  into  the  same  dimension.  Of  the  231  examples,  ICC  did  not 
meet  this  criterion  and  were  placed  in  an  "Other"  category  (Table  III. 24), 


Table  III.24 

Agreement*  by  Dlscrlmlnablllt.y  Item  Distribution 


t-Value 


5.5  & 

5.6- 

7.1- 

9.1  & 

Dimension 

Below 

7.0 

9.0 

E3 

Total 

Cohesion/Comnitment 

12 

4 

9 

9 

34 

Intell igence/Common  Sense 

1 

0 

0 

3 

4 

Self-Discipline/Responsibillty 

3 

3 

3 

16 

25 

Physical /Med leal  Condition 

2 

1 

1 

2 

6 

Mission  Orientation 

1 

6 

7 

5 

ID 

Technical/Tactical  Knowledge 

4 

3 

2 

4 

12 

Psychological  Effects 

4 

3 

2 

0 

9 

Initiative 

1 

4 

4 

4 

13 

Other*3 

32 

24 

20 

24 

ICC 

Total 

60 

(26?) 

48 

(21?) 

56 

(25?) 

67 

(27?) 

231 

a  Greater  than  50?  agreement  among  judges  in  placing  Items  in  dimensions. 
b  Items  not  reliably  retranslated  Into  the  eight  dimensions. 


The  same  group  of  judges  performed  a  scaling  task  that  provides  a  way  of 
determining  which  examples  discriminate  between  "best"  and  "worst"  perform¬ 
ers.  Each  judge  was  asked  to  make  two  ratings.  For  one  rating,  judges  were 
asked  to  think  of  the  "best"  soldier  they  had  ever  worked  with  In  combat  and 
to  decide  how  likely  It  was  that  that  soldier  would  have  behaved  like  the 
soldier  in  each  example.  For  the  other  rating,  they  performed  the  same  task, 
but  this  time  considered  the  "worst"  soldier  they  had  ever  worked  with.  Half 
of  the  16  judges  rated  the  "best"  soldier  first  and  half  rated  the  "worst" 
soldier  first,  using  a  1 E -point  scale  ranging  from  very  unlikely  (1)  to  very 
likely  (15). 

A  dlscrininahlllty  inrlpx  was  cp^c'.'lp+pd  by  computing  a  dependent  t. -vjlu*j 
fur  each  of  the  231  items.  The  Rvalue  is  a  measure  of  the  statistical 
significance  of  the  difference  in  mean  probability  assigned  by  the  raters  to 
their  "be^t"  soldier  performing  the  act  described  in  the  Item  versus  that 
assigned  their  "worst"  soldier  performing  the  act.  A  t-value  equal  to  or 
greater  than  2.95  would  be  significant  at  the  p  <  .01  level. 


As  shown  In  Table  III. 24,  the  231  items  were  divided  roughly  into 
quartlles  on  the  basis  of  fr-values,  as  an  aid  In  selecting  Items  that  had 
both  high  dlscrlmlnablllty  “and  high  agreement.  A  total  of  171  Items  had 
t-values  of  5.6  or  greater.  Thus,  a  sufficient  number  of  Items  discriminated 
Tbest1'  from  "worst"  combat  soldiers  at  a  very  high  level  of  statistical 
significance.  However,  76  of  those  examples  had  been  assigned  to  the  Other 
category  after  retranslation  because  they  had  not  been  reliably  retranslated 
Into  dimensions.  Additionally,  the  categories  of  Intelligence/ Common  Sense, 
Physical /Medical  Condition,  and  Psychological  Effects  contained  very  few 
1  terns . 

A  factor  analysis  (unweighted  least  squares,  with  a  Promax  rotation) 
was  performed  to  attempt  to  reduce  the  number  of  dimensions.  The  results 
provided  guidance  on  how  to  combine  dimensions.  Examples  placed  in  the 
Intelligence/Common  Sense  category  were  reassigned  to  either  Self-Discipline/ 
Responsibility  or  Technical/Tactical  Knowledge  based  upon  the  frequency  of 
judges'  placement  of  the  Items.  Items  from  Physical/Medical  Condition  were 
combined  with  Items  In  the  Self-Dlsclpl 1 ne/ResponsI bl 1 1 ty  category.  Behav¬ 
ioral  examples  of  Psychological  Effects  were  placed  In  the  Mission  Orienta¬ 
tion  category. 

In  developing  the  final  form  of  the  Combat  Performance  Rating  Scale,  the 
goal  was  to  select  Items  that  reflected  good  performance  and  poor  performance 
to  represent  the  domain  of  combat  effectiveness.  In  a  summated  scale,  the 
most  Important  criterion  Is  the  Items'  ability  to  discriminate  between  per¬ 
formance  extremes.  Consequently,  reducing  the  agreement  criterion  for  dimen¬ 
sional  agreement  among  judges  In  order  to  Increase  the  number  of  Items  does 
not  violate  good  construction  practice  for  a  summated  scale.  To  make  sure 
that  we  considered  a  maximum  set  of  discriminating  Items,  we  redefined  the 
dimension  agreement  criterion  (Initially  "greater  than  50%")  to  "equal  to  or 
greater  than  50%"). 

The  agreement  by  dlscrlmlnablllty  Item  distribution  for  the  reduced 
dimensional  set  and  redefined  agreement  criterion  Is  shown  in  Table  III. 25. 
It  should  be  noted  that  at  this  point  five  items  were  viewed  as  too  sensitive 
and  were  deleted  from  further  consideration.  Following  these  changes,  113 
Items  had  t-values  of  5.6  or  greater  and  were  reliably  retranslated  into  one 
of  the  five  dimensions.  This  represented  the  Item  pool  from  which  the  Items 
were  selected  for  further  development  of  the  combat  prediction  rating  scale. 

Item  Selection 

Selection  of  Items  for  the  field  test  version  of  the  scale  was  to  be 
based  primarily  on  dlscrlmlnablllty,  with  consideration  also  given  to 
dimension  agreement,  and  with  an  approximate  balance  between  positive  and 
negative  examples.  Allowing  for  time  constraints  In  testing,  and  eliminating 
poor  Items,  the  goal  was  to  select  80  1tems--the  16  best  discriminating  Items 
from  each  of  the  five  dimensions.  However,  when  Items  were  rank  ordered  on 
the  basis  of  t_-values  within  each  dimension  by  positive  and  negative  items, 
some  t-values  were  too  low  to  allow  the  Item  to  be  Included.  Also,  the 
Initiative  dimension  contained  only  13  Items.  Therefore,  In  addition  to  the 
five  dimensions,  Items  from  the  Other  category  were  selected  for  Inclusion  in 
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the  scale.  The  Items  selected  for  field  testing  are  shown  In  Table  III. 26. 
Including  the  Other  category  resulted  in  a  more  balanced  coverage  of  the 
dimensions  and  of  the  positive/negative  split,  and  a  set  of  items  all  having 
t-values  greater  than  5.6. 


Table  111.25 

Combat  Prediction  Agreement*  by  Dlscrlmlnablllty  Item  Distribution  for 
Reduced  Dimensional  Set  and  Redefined  Agreement  Criteria 

t-Value 


5.5  « 

5.6- 

7.1- 

9.1  & 

Dimension 

Below 

7.0 

9.0 

Above 

Total 

Cohesl on/Comml tment 

14 

4 

9 

10 

37 

Sel  f-DI  scl  pi  1  ne/  Responsl  bl  1 1  ty 

8 

4 

5 

25 

42 

Mission  Orientation 

9 

12 

11 

8 

40 

Technical /Tactical  Knowledge 

8 

6 

5 

3 

22 

Initiative 

2 

3 

4 

4 

13 

Otherb 

21 

16  1 

22 

13 

72 

Total 

62 

(27%) 

45 

(20*) 

56 

(25*) 

63 

(28%) 

226 

*  Equal  to.  or  greater  than  50%  agreement  among  judges. 
b  Items  not  reliably  retranslated  Into  one  of  the  five  dimensions. 


Table  III. 26 

Items  Selected  for  Field  Test  of  Combat  Performance  Prediction  Scale 


Dimension 

Positive 

Negative 

Total 

Cohesl on/Comml tment 

12 

3 

15 

Sel f-DI scl pi  1 ne/Respons 1 bl 1 1 ty 

6 

10 

16 

Mission  Orientation 

8 

7 

15 

Technical/Tactical  Knowledge 

4 

7 

11 

Initiative 

10 

0 

10 

Other 

9 

4 

13 

Total 


49 


31 


80 


To  reduce  the.  administrative  burden  on  any  one  rater,  two  forms  (Form  A 
and  Form  B)  were  developed.  Each  contained  60  items— 40  common  to  both  forms 

and  20  unique  to  each  form. 

Review  and  Rescaling 

The  two  proposed  60-Item  forms  of  the  Combat  Performance  Prediction 
Scale  were  reviewed  by  three  company  grade  Army  officers  and  throe  ARI 
scientists.  As  a  result  of  that  review,  three  Items  common  to  both  forms 
were  deleted  and  a  large  proportion  of  the  remaining  77  Items  were  reworded. 
The  rewording  was  extensive  enough  to  render  questionable  the  discrlnl- 
nablllty  indexes  previously  computed  for  each  Item.  Therefore,  the  77  Items 
were  subjected  to  a  rescaling. 

The  rescaling  workshop  was  conducted  using  the  same  procedures  as  for 
the  original  scaling.  Eight  officers  and  one  civilian  (seven  of  the  nine 
were  combat  veterans)  made  the  "best"  and  "worst"  combat  soldier  ratings  for 
each  of  the  77  Items.  New  t-values  were  computed  for  each  item.  In  general, 
the  rescaled  values  were  or  lower  statistical  significance  than  the  original 
t-values.  However,  in  only  one  case  did  the  rescaled  Item  result  In  a  non¬ 
significant  t-value.  This  Item  (In  Form  A)  was  deleted. 


Field  Test  Version  of  Combat  Effectiveness  Prediction  Scale 


For  the  field  test,  Part  I  of  Form  A  contained  56  Items  and  Form  B 
contained  57  Items.  In  Part  II  of  both  forms,  raters  were  asked  to  respond 
to  three  additional  questions:  how  confident  they  were,  overall,  In  the 
ratings  they  had  just  completed;  how  many  of  the  Items  made  sense  to  them; 
and  which  Items  least  applied  to  the  soldiers  whom  they  had  just  rated. 

The  Held  test  version  of  the  Combat  Effectiveness  Prediction  Scale  thus 
consisted  of  76  items  split  between  two  forms,  with  3  additional  Items 
designed  to  capture  reactions  to  the  scale  itself.  The  instructions  for 
raters  and  two  sample  Items  from  the  field  version  are  shown  In  Figure 
III. 10.  Because  the  development  of  this  scale  followed  a  different  schedule 
than  the  other  criterion  measures,  It  was  field  tested  with  only  a  portion  of 
the  Batch  B  sample.  The  field  test  results  are  reported  In  Section  14  of 
this  report. 
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COMBAT  PERFORMANCE  PREDICTION  LCALE 


INSTRUCTION; 


On  the  following  panes,  you  will  find  examples  that  dcstnbe  activities  o(  soldiers  in  tamb.it,  Assume  mat 
the  soldiers  you  are  rating  were  placr.u  in  the  combat  situation  described  and  hod  the  oppertur.tv  1  >  bninve 
ts  the  soldier  in  ecch  example  behaved.  Then,  using  the  scale  shown  below,  indicate  thy  iiielihjo.l  ol  caen 
soldiei  you  ere  ratmg  performing  at  the  soloier  In  the  example  perfoimed, 


Vary  Fairly  About  50-S0  Fairly  Very 

Unlikely  Unlikely  Chance  Likely  Likely 

oooooooocooooco 


Please  darken  the  circle  under  the  point  on  the  scale  that  gives  the  likelihood  that  each  soldier  would  behave 
in  the  way  described  In  the  example.  For  example,  If  you  think  that  there  was  absolutely  no  chanee  that 
the  soldier  you  ere  rating  would  do  what  the  soldier  in  the  example  did.  then  you  would  darken  the  f!rst 
circle  under  the  Ve>  y  Unlikely  pert  of  the  scale.  If  you  thihk  that  the  soldier  you  are  rating  would  absolutely 
certainly  do  what  the  soldier  in  the  example  did.  then  darken  the  last  eirelu  under  the  Vary  Likely  part  of 
the  scale.  If  the  likelihood  la  between  the  two  extremes,  darken  the  appropriate  circle. 


Please  evaluate  each  soldier's  likelihood  of  doing  every  eetivity.  Do  nut  leave  any  blanks. 


CCMBAT  PERFORMANCE  PREDICTION 
RATING  SCALE  ITEMS 


1.  This  soldier  volunteered  to  lend  a  team  to  an  accident  scene  where  immediate  first  aid  wos  required 
before  an  order  was  given. 


Very 

Fairly 

About  SO-tSO 

Fairly 

Very 

Unlikely 

Unlikely 

Chance 

Likely 

Likely 

m* 

Line  up  the  names 

'0 

o 

0 

0 

0 

0 

0 

0 

0 

0 

o 

0 

0 

0 

0 

MM 

of  the  soldiers 
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o 
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0 

0 

0 

0 

0 
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0 

0 

0 
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5  o 

o 

o 

0 

0 

0 

0 

0 

0 

0 

o 

0 

0 

0 

0 

■* 

with  the  rows 

*0 

o 

0 

o 

0 

0 

0 

0 

0 

0 

o 

0 

0 

0 

0 

*■ 

to  the  right. 

‘0 

o 

o 

o 

0 

0 

0 

0 

0 

0 

o 

0 

0 

0 

0 

Ml 

2.  Neat'  the  end  Of  t  movement,  when  soldier j  were  ordeiod  to  prepare  fighting  positions,  this  soldier 
prepared  his  position  quickly  end  then  aijir.tsd  other  squad  members. 
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Figure  III. 10.  Sample  of  Combat  Performance  Prediction  Scale 
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Section  7 

ADMINISTRATIVE/ARCHIVAL  RECORDS  AS  ARMY-WIDE  PERFORMANCE  MEASURES1 


A  major  activity  within  the  overall  program  of  performance  criterion 
development  is  to  explore  the  use  of  the  archival  administration  records  as 
first-tour  job  performance  criteria  and  In-service  predictors  of  soldier 

effectiveness.  Hie  Enlisted  Master  File  (EMF),  the  Official  Military 

Personnel  File  (OMPF),  and  the  Military  Personnel  Records  Jacket  (MPRJ)  are 
the  Army  records  sources  that  contain  administrative  actions  that  could  be 
used  to  form  measures  of  first-tour  soldier  effectiveness. 

A  serious  difficulty  in  using  administrative  records  for  evaluation 

purposes  is  that  the  material  in  the  records  very  often  reflects  only 
exceptionally  good  or  exceptionally  poor  performance.  Measures  of 
performance  based  on  personnel  actions  that  appear  infrequently  could  have 
very  little  variance.  A  strategy  for  dealing  with  the  skewness  and  lack  of 
variability  In  records  data  that  result  from  low  base  rates  is  to  combine 
records  of  different  kinds  of  events  and  actions  into  more  general  Indexes. 
When  scores  on  administrative  measures  that  reflect  the  same  underlying 
constructs  are  combined,  the  base  rate  might  improve  to  a  level  where 

significantly  higher  correlations  with  other  variables  would  be  possible. 
Accordingly,  project  staff  undertook  a  detailed  examination  of  the  three 
archival  data  sources  and  an  analysis  of  the  feasibility  of  developing 
first-tour  and  in-service  predictors  from  them. 


Identification  of  Administrative  Indexes 

A  preliminary  list  of  administrative  measures  Indicative  of  soldier 
effectiveness  was  developed  from  a  review  of  relevant  Army  Regulations, 
previous  research  efforts  in  military  settings,  and  Interviews  with 
knowledgeable  Army  personnel.  The  list  Is  presented  in  Table  III. 27.  A 
description  of  the  detailed  investigation  into  each  of  the  three  records 
sources  follows. 


Enlisted  Master  File  (EMF) 

The  EMF  is  an  automated  inventory  of  personal  data,  enlistment  condi¬ 
tions,  and  military  experience  for  every  enlisted  individual  currently  on  the 
U.S.  Army  payroll.  It  contains  a  large  number  of  variables  for  each  individ¬ 
ual,  ranging  from  pay  grade  to  Skill  Qualification  Test  (SQT)  scores  to  the 
Army's  operational  performance  appraisal  ratings  in  the  form  of  the  Enlisted 
Efficiency  Report  (EER). 


1 Th i s  section  is  based  primarily  on  an  ART  Technical  Report  7F4 ,  The 

Development  o  f  Adi:ij_n  1  s_tr  a  tive  Measures  As _ Indicators  of  Soldier  E  f  fee  - 

tiveness,  Fy  Barry  T.  "Xiogc'haiipTl  Carolyn  beMeyer  Karris,  and  TtoLerT 


Table  III. 27 


Preliminary  List  of  Administrative  Measures  Indicative  of 
Soldier  Effectiveness 


•  Reason  for  Separation  From  the  Army 
9  Reenllstment  FI Iglblllty 

•  Reenllstment  Eligibility  Bar 

•  Enlisted  Evaluation  Repor*  (EER) 

•  Promotion  Rate 

•  Number  and  Duration  of  AWOt/Desertions 

•  Number  and  Type  of  Articles  15 

•  Number  and  Type  of  Courts-Martial 

•  Number  and  Type  of  Awards/Badges 

•  Number  and  Type  of  Letters  of  Appreciation/Commendation 

•  Number  and  Type  of  Letter*?  of  Reprimand/Admonition 

•  Number  and  Type  of  Certlilcates  of  Achievement/Commendation 

•  Number  and  Type  of  Civilian  Courses  Attended/Completed 

•  Number  and  Type  of  Service  Courses  Attended/Completed 

•  Performance  In  Service  Courses 


An  Initial  examination  of  the  EMF  Identified  four  variables  as  poten¬ 
tially  useful:  (a)  reason  for  separation,  (b)  reenllstment  eligibility, 
(c)  reenllstment  eligibility  bar,  and  (d)  EER  score. 

In  theory,  the  EER,  which  Is  a  weighted  average  of  a  soldier's  last 
five  performance  ratings,  should  be  a  very  useful  variable.  As  a  practical 
matter,  however,  for  Project  A  purposes  Its  value  may  be  limited,  EER  rat¬ 
ings  are  obtained  only  for  soldiers  In  grades  E5  and  above,  so  not  more  than 
a  small  percentage  of  first-tour  enlisted  personnel  Is  likely  to  have  had 
even  one  EER  at  the  time  of  the  Project  A  data  collection.  Also,  for  under¬ 
standable  reasons,  FER  ratings  have  tended  to  cluster  near  the  maximum  score. 

Information  relevant  to  two  additional  variables  is  available  from  the 
EMF.  First,  it  is  possible  to  compute  a  promotion  rate,  defined  as  grades 
advanced  per  year,  for  each  soldier.  Second,  while  neither  the  number  of 
times  an  Individual  has  been  AWOL  nor  the  duration  of  each  AWOL  is  available 
from  the  EMF,  it  is  possible  to  assign  soldiers  to  the  dichotomous  variable, 
"Has  or  Has  Never  Been  AWOL." 

Information  on  awards,  badges,  letters  and  certificates  of  appreciation, 
achievement,  and  commendation,  Articles  15,  and  so  forth  Is  not  contained  on 
the  tMF.  Information  of  this  type  exists  only  In  the  Individual  soldier's 
Official  Military  Personnel  Flic  (OMPF)  or  Military  Personnel  Records  Jacket 
(MPRJ ) 
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Official  Military  Personnel  File  (OMPD 

The  OMPF  Is  the  permanent,  historical,  and  official  record  of  a  member's 
military  service.  The  information  for  enlisted  personnel  is  maintained  on 
microfiche  records  located  at  the  Enlisted  Records  and  Evaluation  Center 
(EREC),  Fort  Benjamin  Harrison,  Indiana. 

Depending  upon  their  purpose,  documents  are  filed  in  one  of  three 
sections: 

•  The  performance  (P)  fiche  -  the  portion  of  the  OMPF  where 
performance,  commendatory,  and  disciplinary  data  are  filed. 

«  The  service  (S)  fiche  -  the  OMPF  section  where  general  Information 
and  service  dataware”?! led . 

•  The  restricted  (R)  fiche  -  the  OMPF  section  for  historical  data  that 
may  Bebi a  seda  g  a  Inst  the  soldier  when  viewed  by  selection  boards  or 
career  managers.  For  this  reason  release  of  Information  on  this 
fiche  is  controlled. 

The  usefulness  of  the  microfiche  records  for  project  purposes  was  examined 
systematically  via  a  pilot  study. 

Sample  Selection.  A  random  sample  of  25  enlisted  personnel  from  each  of 
the  19  MOS  being 'Ttud led  in  Project  A  was  selected  from  the  FY82  Enlisted 
Master  File.  The  list  of  475  names  and  corresponding  social  security  numbers 
(SSN)  was  forwarded  to  the  Enlisted  Records  and  Evaluation  Center,  with  a 
request  to  have  the  475  records  available  for  a  project  data  collection  team 
that  would  examine  them  on  site. 

Data  Collection  and  Analysis.  A  data  collection  form  for  recording  the 
administrative  measures  listed  Tn  Table  III. 27  was  developed.  Upon  arrival 
at  Fort  Benjamin  Harrison,  the  data  collection  team  was  handed  414  microfiche 
packets.  This  represented  89%  of  the  466  packets  that  EREC  personnel 
attempted  to  locate  (nine  names  had  been  omitted  in  the  transmission  of  the 
request).  Each  of  the  microfiche  records  in  the  packets  was  examined  by  a 
staff  member  and  information  was  entered  on  the  records  colIncMon  form. 

After  examining  the  microfiche  and  the  regulations  governing  their 
composition,  as  well  as  interviewing  knowledgeable  officials,  the  team 
reached  a  number  cf  conclusions,  which  are  expressed  below  in  terms  of 
optima!  and  actual  outcomes: 


Optimal  Outcomes  - 

(1)  Performance  data  for  475  soldiers  would  be  available. 

(2)  All  475  soldiers  would  be  new,  first-time  soldiers  in  FY81. 

(3)  No  Enlisted  Evaluation  Reports  (EER)  would  be  found. 

(4)  All  authorized  documents  would  appear  on  microfiche. 

(5)  Recorded  information  would  be  timely. 


Actual  Outcomes  - 


(1)  Performance  data  were  available  for  only  136  soldiers— 291  of  the 
projected  sample. 

(2)  Of  the  136  soldiers  for  whom  performance  information  was  available, 
44  (32%)  were  prior  service  members. 

(3)  Since  it  had  been  assumed  that  the  sample  was  comprised  of  new, 
first-term  soldiers,  individuals  would  not  have  been  in  the  Army 
long  enough  to  have  had  an  EER.  However,  26  eERs  were  found  among 
the  records  for  20  soldiers,  all  of  whom  were  prior  service 
members. 

(4)  While  many  documents  are  authorized  to  appear  In  the  OMPF  perform¬ 
ance  section,  a  recent  change  to  Army  Regulation  640-10  requires 
written  filing  instructions  if  certain  documents  are  to  be 
entered.  For  example,  a  letter  of  commendation  will  not  routinely 
be  forwarded  for  filming;  It  will  be  sent  to  EREC  only  if  it  is 
specifically  directed  to  the  OMPF. 

Thus,  it  is  possible  for  soldiers  to  have  a  number  of  documents  In 
their  Military  Personnel  Records  Jacket  that  are  authorized  to 
appear  on  OMPF  microfiche,  but  that  may  not  be  there  because  they 
were  not  directed  to  the  OMPF. 

(5)  For  grades  below  E5  (the  grade  levels  of  enlisted  personnel  in  the 
first  major  Project  A  data  collection),  there  is  a  backlog  of  8  to 
12  months  from  the  time  a  personnel  action  Is  taken  until  it 
appears  on  microfiche  at  EREC.  The  primary  reason  for  this  backlog 
is  that,  for  the  grades  E5  and  above,  microfiche  are  used  by 
central  promotion  boards.  Documents  submitted  for  filming  for 
these  individuals  take  precedence  over  documents  received  for 
soldiers  below  the  grade  of  E5. 


Because  of  these  aspects  of  the  microfiche  records,  the  next  step  was  to 
determine  the  feasibility  of  developing  criterion  indexes  from  the  Military 
Personnel  Records  Jacket,  known  as  the  201  File. 


Military  Personnel  Records  Jacket  (MPRO) 


The  tiPRJ,  or  201  File,  is  the  primary  mechanism  for  storing  Information 
about  an  individual's  service  record.  Updates/additions/corrections  to  the 
file  are  made  at  the  time  of  the  action.  The  MPRJ  physically  follows  the 
individual  wherever  he  or  she  goes  and  is  normally  located  at  the  Military 
Personnel  Office  (MILPO)  that  serves  the  soldier's  unit. 

The  feasibility  of  using  data  from  tne  201  File  for  Project  A  evalua¬ 
tions  was  examined  in  much  the  same  way  as  for  the  nicrofiche  records.  To 
develop  a  data  collection  form  that  could  be  used  to  record  1nformation  from 
201  Files,  detailed  reviews  of  relevant  Army  Regulations  and  interviews  with 
knowledgeable  Army  personnel  were  conducted.  An  expanded  list  of  potential 
indexes  was  compiled  (Table  III. 23)  and  a  records  collection  form  was 
developed  for  use  in  a  pilot  study. 
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Table  III. 28 


Expanded  List  of  Administrative  Measures  Indicative  of 
Soldier  Effectiveness 


•  Comparison  of  Skill  Level  of  Primary  to  Duty  MOS 

•  Existence  of  Secondary  MOS 

•  Existence  of  Skills  Qualification  Identifier  (3QI) 
e  Existence  of  Additional  Skill  Area  (ASI) 

•  Existence  of  Language  Identifier  (LI) 

•  Record  of  Skill  Qualification  Test  (SQT)  Score  Within  Past  12  Months 

•  Type  of  Reenlistment  Eligibility 

•  Type  of  Military  Education  Leadership  Course 

•  Level  of  Highest  Civilian  Education 

•  Promotion  Rate 

•  Existence  of  Promotion  Packet  at  E4 

•  Number  and  Type  of  Awards/Badges 

•  Record  of  Requalification  Weapons  Score  Within 

Past  12  Months 

c  Number  and  Type  of  Certificates  of  Achievement/ 
Appreciation/Commendatlon 

•  Number  and  Type  of  Letters  of  Appreciation/Commendation 
»  Number  and  Type  of  Letters  of  Reprimand, 7 Admonition 

•  Number  of  Additional  Military  Training  Courses  Completed 

•  Number  and  Type  of  Correspondence  Courses  Completed 

•  Number  of  Additional  Civilian  Education  Classes  Completed 

•  Course  Summary  and  toll  1  ties  Ratings  -  Service  School 

•  Professional  Competence  and  Standards  Ratings  and 

Summary  Score  of  Enlisted  Efficiency  Report 

•  Type,  Sentence,  Suspension,  Vacation  of  Court-Martial 

•  Existence  of  Court-Martial  Proceedings  in  Action  Pending 

•  Reason  for  Rar  to  Reenlistment 

•  Number  and  Duration  of  AWCL 

•  Number  of  Violations  and  Reason  for  Article  15 

•  Reason  for  FLAG  Action 

•  Number  of  and  Reason  for  Disposition  -  Block  to  Promotion 


Sample  Selection.  The  plan  was  to  collect  records  data  from  the  MPRJ 
for  a  sample  of  ?50  soldiers,  150  in  each  of  five  MOS  at  five  Army  posts.  To 
achieve  this  sample  size  while  allowing  for  unavailability  of  some  records, 
the  records  of  200  soldiers  at  each  post  were  requested. 

To  Increase  the  likelihood  that  findings  from  the  records  collection 
could  be  generalized,  MOS  choice  was  based  on  diversity.  Each  MOS 
represented  a  different  Career  Management  Field  (CMF),  a  different  ASVAB  area 
composite,  and  a  different  cluster  where  "clusters"  refer  to  the  job 
groupings  derived  from  the  Project  A  MOS  clustering  (Rcsse..  Borman,  Campbell, 
&  Osborn,  1583).  The  selected  MOS  and  the  corresponding  incumbent 
populations  are  shewn  in  Table  III. 29. 
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Table  III. 25 


MOS  x  Post  Populations  In  Study  of  Military  Personnel  Records  Jackets 


M0Sa 


Post 

05C 

1  IB 

64C 

7 1L 

SIB 

Total 

A 

42 

149 

111 

108 

98 

508 

B 

182 

505 

199 

252 

207 

1,345 

C 

125 

1S3 

198 

226 

165 

907 

D 

53 

359 

112 

91 

73 

688 

E 

56 

196 

134 

_74 

8.2 

542 

Total 

458 

1*402 

754 

751 

625 

3,990 

aM0S: 


05C  Radio  TT  Operator 

11B  Infantryman 

64C  Motor  Transport  Operator 

71L  Administrative  Specialist 

91 B  Medical  Care  Specialist 


Data  Collection  Procedure.  Data  were  collected  by  teams  of  two  research 
staff  members  In  2-day  visits  to  each  of  five  posts.  Table  III. 30  Indicates 
the  number  of  IIPRJs  from  which  data  were  collected  at  each  post. 


Table  III -30 

Number  of  Military  Personnel  Records  Jackets  Requested  and 
Received  at  Each  Post 


Number  of  MPRJ 

Percent 

Post  Requested  Received  Received 


200 

133 

67 

200 

153 

77 

200 

156 

78 

159 

80 

146 

73 

747 

7  5 

Frequency  distributions  were  generated  for  each  data  field.  Based  upon 
these  frequencies,  a  set  of  38  variables  was  created  (Table  III. 31).  Vari¬ 
ables  11-13  and  21-24  were  created  based  upon  the  model  of  soldier  effective¬ 
ness  dimensions  that  had  been  previously  developed  (Borman  et  al.,  1985a). 
This  research  identified  the  following  performance  dimensions  as  relevant  to 
all  soldiers,  regardless  of  their  MOS: 

A.  Controlling  own  behavior  related  to  personal  finances, 
drugs/alcohol,  and  aggressive  acts 

B.  Adhering  to  regulations,  orders,  and  SOP  and  displaying 
respect  for  authority 

C.  Displaying  honesty  and  integrity 

D.  Maintaining  proper  military  appearance 

E.  Maintaining  proper  physical  fitness 

F.  Maintaining  own  equipment 

G.  Maintaining  living  and  work  areas  to  Army/unit  standards 

H.  Exhibiting  technical  knowledge  and  skill 

I.  Showing  initiative  and  extra  effort  on  the  job/misslon/ 
assignment 

J.  Attending  to  detail  on  jobs/assignments/equipment  checks 

K.  Developing  own  job  and  soldiering  skills 

L.  Effectively  leading  and  providing  Instruction 
to  other  soldiers 

M.  Supporting  other  unit  members. 


Specifically,  In  addition  to  counting  the  number  of  Articles  15  that  a 
soldier  received,  for  example,  we  recorded  the  reason  for  the  disciplinary 
action  and  mapped  these  reasons  onto  the  model's  dimensions.  This  allowed 
for  the  creation  of  variables  based  on  the  content  of  administrative  actions 
as  well  as  on  a  count  of  those  actions.  This  was  consistent  with  the  Project 
A  construct  validation  approach. 

The  original  request  for  1,000  MPRJs  specified  a  Basic  Active  Service 
Date  (BASD)  window  of  17  months.  At  this  point,  the  17-month  window  was 
reduced  to  13  months  to  more  accurately  reflect  the  time  that  soldiers  In  the 
actual  FY83/84  cohort  first-tour  data  collection  would  be  In  the  service. 
Only  those  soldiers  who  entered  the  Army  between  1  July  1981  -  31  July  1982 
at  an  Initial  grade  of  PFC  or  less  were  retained.  The  result  was  a  sample  of 
650  soldiers  in  the  11B,  05C,  64C,  711,  or  91B  MOS  who  had  been  in  the  Army 
between  14  and  27  months. 


List  of  Created  Variables  In  Study  of  Administrative  Measures 


Variable  Numb e r 


t 


01 

02* 

03 

04* 

05* 

06 

07 

08 

09 

10 

11 

12 

13 

14* 

15 

16* 

17* 

18* 

19 

20 
21 

22 

23 

24 

25 

26 

27 

28 


Has  SQI,  ASI,  or  Language  Identifier 
Is  working  at  skill  level  DMOS 
higher/lower  than  PMOS 
Is  eligible  to  reenlist 
Highest  grade  attained 
Current  grade 
Never  demoted 
Number  of  awards 
M16  rating 

Has  EXP  grenade  rating 
Number  of  letters/certificates 
Cited  for  exhibiting  technical  knowledge  and 
skill  (Constructs  H  and  J)® 

Cited  for  physical  and  mental  self  development 
(Constructs  E  and  K)a 

Cited  for  constructs  other  than  E,  H,  J,  and  K® 

Has  had  special  military  education 

Number  of  military  training  courses 

Years  of  civilian  education 

Has  high  school  diploma 

Has  earned  civilian  education  credits 

Number  of  Articles  15/FLAG  actions 

Has  been  AWOl 

Cited  for  failure  to  adhere  to  rules  and  regulations 
and  disrespect  for  authority  (Construct  8)® 

Cited  for  failure  to  control  own  behavior 
(Construct  A}® 

Cited  for  Construct  violations  other  than 
Constructs  A  and  B« 

Number  of  times  cited  for  construct  violations 
(Variable  21  >  22  ♦  23)® 

Number  of  times  assigned  extra  duty 
Has  had  punishment  suspended 
Has  forfeited  pay 
Has  been  restricted 


Has  been  confined 
Initial  grade 

Change  In  grade  (Variables  05,  30) 

Time  period  In  years  between  first  and  last 
grade  change 

Promotion  rate  (number  of  grades  advanced 
per  year  --  Variables  31/32) 

Has  received  punishment 

Has  received  Army  Achievement  Medal  ( AAM) 

Has  received  air  assault  badge 

Has  received  parachute  badge 

Has  received  other  award 


*  Indicates  an  Interim  variable  used  only  to  define  the  actual  variable, 
The  Interim  variable  was  not  uePd  In  subsequent  analyses. 

®  See  construct  list  In  text.  Construct  definitions  appear  In  Borman 
et  al.  (1987). 
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Military  Pers°nnel  Records  Jacket  (MPRJ)  -  Official  Military  Personnel 
File  ~ (OfIPF)  Comparison.  Using  the  records  collection  form  developed  to 
extract  records  data  from  the  MPRJ,  three  research  staff  members  spent  2  days 
collecting  records  data  from  the  OMPFs  of  292  soldiers.  The  292  Individuals 
represented  a  random  sample  of  the  650  soldiers  from  whose  MPRJs  administra¬ 
tive  records  data  had  previously  been  collected.  Thus,  the  amount  of  infor¬ 
mation  available  from  the  records  sources  could  be  compared. 

The  frequency  distributions  of  selected  administrative  variables  avail¬ 
able  from  the  MPRJ  and  the  CMPF  are  compared  In  Table  III. 32.  As  can  be 
seen,  the  MPRJ  was  found  to  be  a  much  richer  source  of  Information  on  the 
administrative  actions  of  Interest  In  Project  A.  In  the  extreme  case,  Infor¬ 
mation  relevant  to  a  soldier's  reenlistment  eligibility  was  not  even  avail¬ 
able  from  the  OMPF. 

Military  Personnel  Records  Jacket  (MPRJ)  -  Enlisted  Master  File  (Ef'F) 
Comparison.  Presented  Tn  Table  1TT.33  are  frequency  distributions  of 
selected  variables  collected  from  the  MPRJ  that  are  also  available  from  the 
EMF.  As  can  be  seen,  unlike  the  MPRJ-OMPF  comparison,  a  rather  high  degree 
of  correspondence  exists  between  the  MPRJ  and  the  EMF.  It  should  be  noted 
that  the  EMF  was  an  FY83  end-of-year  tape.  The  MPRJ  data  were  collected 
during  the  second  and  third  weeks  In  October  1983.  Thus,  the  MPRJ  informa¬ 
tion  was  being  compared  to  EMF  entries  that  were,  at  most,  3  weeks  behind  the 
Information  in  the  field.  Even  in  light  of  the  3-week  difference,  the 
correspondence  between  sources  Is  Impressive  and  highlights  the  benefits  of 
having  current  EMF  information  available. 


Results  of  Analysis  of  MPRJ  Data 
Analyses  were  conducted  In  two  stages: 

(1)  Identification  of  administrative  variables  potentially  useful  In 
Project  A  measures 

(2)  Examination  of  the  relationships  of  the  identified  variables  with 
selected  nonadministratlve  variables  (e.g.,  Post,  MOS,  Moral 
Waiver 


Variable  Selection 

A  first  step  In  determining  the  usefulness  for  Project  A  purposes  of  the 
administrative  variables  collected  from  MPRJs  (201  Files)  was  to  select  those 
measures  with  an  acceptable  amount  of  variance.  The  frequency  distributions 
for  each  administrative  measure  are  presented  in  Table  III. 34.  The  product 
moment  correlations  among  the  administrative  variables  are  presented  in  Table 
111.35. 


Table  III. 32 

Frequency  Distributions  for  Selected  Variables  In  MPRJ-OMPF  Comparison 
(N  -  292  soldiers) 


Variable 

Category 

MPRJ 

(201  File) 

OMPF 

(Microfiche) 

Number  of  Letters/Certificates 

0 

218 

287 

1 

45  • 

4 

2  or  More 

29 

1 

Number  of  Awards 

0 

209 

262 

1 

69 

27 

2  or  More 

14 

3 

Has  Received  Article  IS 

No 

258 

278 

Yes 

34 

14 

Has  Been  AWOL 

No 

286 

290 

Yes 

6 

2 

Has  Had  Special  Military  Education 

No 

270 

288 

Yes 

22 

4 

Is  Eligible  to  Reenlist 

Blank 

41 

292 

No 

29 

a  a 

Yes 

222 

a  m 

Highest  Grade  Attained 

PV1 

1 

237 

PV2 

13 

20 

PFC 

156 

17 

SP4/CPL 

116 

18 

SP5/SGT 

1 

-- 

SP6/SSG 

5 

Change  In  Grade 

-1 

1 

m  m 

0 

19 

278 

1 

56 

3 

r\ 

L. 

135 

2 

3 

77 

9 

4 

2 

-- 

5 

2 

■  »* 

Table  III. 33 


Frequency  Distributions  for  Selected  Variables  In  MPRJ/EMF  Comparison 
(N  -  650  soldiers) 


Variable 

Category 

MPRJ 

(201  File) 

EMF 

(FY83  End) 

Has  Been  AWOL 

Mo 

631 

633 

Yes 

IS) 

17 

Has  Had  Special  Military  Education 

No 

620 

623 

Yes 

30 

27 

Is  Eligible  to  Reenllst 

Blank 

76 

71 

No 

57 

52 

Yes 

517 

527 

Initial  Grade 

Blank 

1 

2 

PV1 

497 

516 

PV2 

76 

68 

PFC 

76 

64 

Current  Grade 

PV1 

13 

7 

p  n 

32 

14 

PFC 

309 

341 

SP4/CPL 

290 

282 

SP5/SGT 

6 

6 

Promotion  Rate 

0 

40 

41 

1 

136 

112 

2 

375 

401 

3 

98 

96 

4 

1 

0 

4 

Based  upon  the  Information  presented  In  Tables  III. 34  and  III. 35,  and 
the  regulations  governing  reenlistment  and  promotion  criteria,  six  variables 
were  selected  as  potentially  useful  criteria  and  in-service  predictors  for 
Project  A.  The  six  measures  were: 

•  El Iglble  to  Reenllst 

•  Number  of  Letters/Certificates 

•  Number  of  Awards 
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Frequency  and  Percentage  Distributions  for  Administrative  Variables 
Variable 


Number 

Variable 

Category 

F  requenc.y 

Percent 

01 

Has  SQI/ASI/LI 

No 

518 

79.7 

Yes 

132 

20,3 

03 

Is  El Iqlble  to  Reenllst 

Blank 

76 

m 

No 

57 

9.9 

Yes 

517 

90.1 

06 

Never  Demoted 

No 

25 

3.9 

Yes 

625 

96.1 

07 

Number  of  Awards 

0 

436 

67.1 

1 

169 

26.0 

2  or  more 

37 

6.9 

08 

MIS  Rating 

Blank 

37 

« 

MKM 

290 

47.3 

SP5 

183 

29.9 

EXP 

140 

22.8 

09 

Has  EXP  Grenade  Rating 

No 

490 

75.3 

Yes 

160 

24.6 

10 

Number  of  Letters/Certificates 

0 

461 

70.9 

1 

113 

17.4 

2  or  more 

76 

11,7 

11 

Cited  for  Technical  Knowledge 
and  Skill  (Constructs  H  and  J) 

0 

525 

80.8 

1 

83 

12.8 

2  or  more 

42 

6.5 

12 

Cited  for  Physical  and  Mental 

Self-Development 

0 

609 

93.7 

(Constructs  E  and  K) 

1  or  more 

41 

6.3 

13 

Cited  for  Constructs  Other  Than 

0 

582 

89.5 

E,  H,  J,  and  K 

1  or  more 

68 

10.5 

15 

Number  of  Military  Training  Courses 

0 

484 

74.5 

1 

128 

19.7 

2  or  more 

38 

5.9 

19 

Has  Received  Article  15/FLAG  Action 

No 

576 

88.6 

Yes 

74 

11.4 

20 

Has  3een  AWOL 

No 

631 

97.1 

Yes 

19 

2.9 

(Continued) 
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Table  III. 34  (Continued) 

Frequency  and  Percentage  Distributions  for  Administrative  Variables 


Variable 

Number 

Variable 

Category 

Frequency 

22 

Cited  for  Failure  to  Control 

0 

620 

Own  Behavior  (Construct  A) 

1  or  more 

30 

23 

Cited  for  Construct  Violations 

0 

625 

other  than  A  and  B 

1  or  more 

25 

24 

Number  of  Times  Cited  for 

554 

Construct  Violations 

0 

1 

61 

2  or  more 

35 

25 

Has  Received  Extra  Duty 

No 

595 

Yes 

55 

26 

Has  Had  Punishment  Suspended 

No 

611 

Yes 

39 

27 

Has  Forfeited  Pay 

No 

583 

Yes 

67 

28 

Has  Been  Restricted 

No 

610 

Yes 

40 

29 

Has  Been  Confined 

No 

638 

Yes 

12 

33 

Promotion  Rate 

0 

40 

(Grades  Advanced/Vear) 

1 

136 

2 

375 

3 

98 

4 

1 

34 

Has  Received  Punishment 

No 

574 

Yes 

76 

35 

Has  Received  AAM 

No 

582 

Yes 

68 

36 

Has  Received  Air  Assault  Badge 

No 

618 

Yes 

32 

37 

Has  Received  Parachute  Badge 

No 

559 

Yes 

91 

38 

Has  Received  Other  Award 

No 

584 

Yes 

66 

Percent 


95.4 

4.6 

96.1 

3.9 


85.2 

9.4 

5.4 

91.5 

8.5 

94.0 

6.0 

89.7 

10.3 

93.9 

6.1 

98.1 

1.9 

6.1 

20.9 

57.7 

15.1 

.1 

88.3 

11.7 

89.5 

10.5 

95.1 

4.9 

86.0 

14.0 

89.95 

10.1 
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•  Number  of  Military  Training  Courses 

•  Has  Received  Article  15/FLAG  Action 


•  Promotion  Rate  (Grades  Advanced/Vear) 

Based  on  the  frequency  distributions  shown  in  Table  III. 24*  the  Number  of 
Letters/Certificates,  Number  of  Awards,  and  Number  of  Military  Training 
Courses  variables  were  transformed  into  dichotomous  var1ables--Has  Received 
Letters/Certificates,  Has  Received  Awards,  and  Has  Had  Military  Training 
Courses. 


Relationships  of  Administrative  Measures  With  Other  Variables 

Each  of  the  six  administrative  measures  and  a  combined  "Has  Received 
Letter/Certificate/Award"  variable  were  subjected  to  a  series  of  analyses. 
These  Included  an  examination  of  MOS  and  Post  differences;  stepwise  multiple 
regressions,  In  which  AFQT,  Moral  Waiver,  Sex,  and  Race  were  entered  after 
controlling  for  Post  and  MOS  effects;  and  univariate  analyses,  in  the  form  of 
chi-square  tests,  for  those  variables  entered  Into  the  regression  equation 
with  a  significant  F  value  at  the  time  of  first  entry. 

The  findings  from  this  analysis  are  summarized  In  Table  111,36.  The 
asterisked  cells  Indicate  which  of  the  other  available  variables  (Post,  MOS, 
AFQT,  Moral  Waiver,  Sex,  and  Race)  were  significantly  related  to  each  of  the 
administrative  measures  in  both  the  univariate  and  multivariate  analyses. 
The  pattern  of  significant  end  nonsignificant  relationships  found  was 
encouraging. 

First,  there  was  no  evidence  that  a  soldier's  race  was  a  significant 
determiner  of  his/her  Reenlistment  Eligibility,  Number  of  Awards,  or  any 
other  of  the  Army-wide  administrative  measures.  Second,  although  a  soldier's 
sex  was  related  to  Awards  (males  received  more)  and  to  Letter/Certificate 
(females  received  more),  when  the  two  variables  worn  combined  into  the 
Letter/Certificate/Award  measure,  sex  differentials  wmre  no  longer  statisti¬ 
cally  significant. 

Third,  AFQT  score  or  mental  category  was  related  to  successfully 
completing  Military  Training  Courses  and  to  Number  of  Awards,  indicating  the 
possible  usefulness  of  the  ASVAB  in  predicting  aspects  of  Army-wide  perfor¬ 
mance.  Fourth,  both  Reenlistment  Eligibility  and  Promotion  Rate,  which  may 
be  related  to  noncognitlve  as  well  as  cognitive  factors,  do  not  appear  to  be 
dependent  on  the  soldier's  location  (Post),  MOS,  or  demographic  group  (i.e., 
these  measures  seem  to  be  fairly  even-handedly  administered  Army-wide). 

Finally,  there  are  distinct  MOS  and  post  differences  in  average  scores 
for  most  of  the  measures.  For  example,  Administrative  Specialists  1711) 
received  more  Letters/Certificates  and  Infantrymen  (IIP)  more  Awards  than 
soldiers  in  other  MOS.  Soldiers  at  one  of  the  five  poets  visited  received 
more  letters,  certificates,  and  awards,  and  more  extra  training  than  soldiers 
at  the  other  posts.  Care  will  have  to  be  exercised  In  pooling  performance 
measurement  data  across  MOS  and  posts  to  try  to  week  out  sources  of  criterion 
contamination  (e.g.,  differences  In  local  filing  practices)  while  maintaining 
valid  distinctions. 
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Table  III. 36 


Summary  of  Univariate®  and  Multivariate*3  Analyses  of 
Administrative  Variables 


Administrative  Measure 
Reenllstment  Eligibility 
Letter/Cert  1*1 :ate 
Awards 

Letter/Certificate/Award 
Military  Training  Courses 
Article  15/FLAG  Action 
Promotion  Rate 


Moral 

Post  MOS  AFQT  Waiver  Sex  Race 


★  * 

★  *  * 

*  *  * 

★  *  * 


*  * 

*  * 

★ 
* 


* 


* 


*  p  <  .05,  In  both  univariate  and  multivariate  analyses.  In  the  multi¬ 
variate  analysis  the  significance  level  refers  to  the  F  value  obtained  woeri 
tho  variable  was  first  entered  Into  the  prediction  equation  (see 
footnote  b  for  order  of  entry), 

®Un1var1ate  analyses  consisted  of  chi-square  tests  and,  where  appro¬ 
priate,  analyses  of  variance. 

^Multivariate  analyses  consisted  of  stepwise  multiple  regressions.  Control 
variables,  consisting  of  four  dichotomous  Post  variables  and  four  dichoto¬ 
mous  MOS  variables,  were  entered  first,  followed  by  AFQT,  Moral  Waiver,  Sex, 
and  Race,  In  turn. 


Criterion  Field  Test:  Self-Reports  of  Administrative  Actions 

While  the  use  of  administrative  measures  Is  consonant  with  the  Project  A 
multlmethod  approach  to  performance  measurement,  and  while  these  indexes  hold 
promise  as  criteria  of  first-tour  soldier  performance  and  In-service 
predictors  of  second-tour  performance,  It  must  be  asked  whether  the  effort 
and  eypense  of  collecting  these  Indexes  from  the  201  Files  are  justified  by 
the  outcome.  Also,  while  there  was  a  high  dngree  of  correspondence  between 
Information  on  the  EMF  computerized  file  and  Information  collected  from  the 
Individual  201  Mies,  a  number  of  the  most  promising  variables  are  not 
available  from  the  EMF. 

Accordingly,  a  self-report.  Instrument,  the  Personnel  Pile  Information 
Form,  was  developed  and  administered  during  th'  Batch  A  field  testing.  The 
self-report  Information  could  then  be  compared  to  the  information  in  actual 
201  Files,  obtained  by  the  project  team  during  the  field  test  period. 
Information  on  the  field  test  results  and  subsequent  modifications  of  the 
administrative  measures  is  contained  In  Section  15. 


Section  8 


CRITERION  FIELD  TESTS:  SAMPLE  AND  PROCEDURE1 


The  Initial  development  of  the  Project  A  criterion  measures  has  been 

described  in  Sections  2-7.  These  measures  were  revised  on  the  basis  of 
experience  from  the  criterion  field  tests.  This  section  describes  the 

sample  and  procedures  that  were  used  In  the  field  tests.  Results  from  the 
field  tests  of  specific  measures  are  reported  in  the  sections  that  follow. 

The  objectives  of  the  criterion  field  tests  were  to: 

•  Provide  Item/scale  analyses  for  the  subsequent  revision  of  the 

criterion  measures  to  be  used  In  the  major  validation  samples. 

»  Provide  data  on  the  reliabilities  and  factor  structures  of  the 

performance  ratings,  job  Sample  measures,  and  job  knowledge  tests. 

•  Provide  data  to  estimate  the  interrelationships  among  the  major 
kinds  of  criterion  measures.  , 

•  Evaluate  the  data  collection  procedures  for  u$e  in  the  subsequent 
large-scale  Concurrent  Validation. 


The  Sample 

The  sample  for  the  field  tests  was  drawn  from  nine  different  jobs,  or 
Military  Occupational  Specialties  (MOS),  and  from  six  different  locations. 
The  nine  jobs  and  their  MOS  des1gnat1on--the  now  familiar  Batch  A  and  Batch 
B--were-as  follows: 

l IB  Infantryman 

13B  Cannon  Crewman 

19E  Armor  Crewman 

31C  RjJMo  Teletype  Operator 

6JB  Light  Wheel  Vehicle  Mechanic 

64C  Motor  Transport  Operator 

71L  Administrative  Specialist 

91A  Medical  Specialist 

95B  Military  Police 

Tables  III. 37  and  III. 38  provide  a  br  ^kdown  of  the  criterion  field  test 
sample  sizes  by  MOS  and  location,  and  by  race  and  sex,  respectively. 
USAREUR  refers  to  the  da  a  collection  rite  just  outside  Frankfurt,  Germany. 


Uhls  section  Is  based  primarily  on  a  paper,  t  _  Hon  Reduction  and 
Combination  via  a  Participative  Decision -Making  Panel ,  by  John  T7  Campbell 
and  James  H,  Harris,  in  an  ARI  Research  Note  (in  preparation)  which  supple¬ 
ments  t h 1  .  Annual  Report. 


Table  III. 37 


Field  Test  Sample  Soldiers  by  MOS  and  Location 


Location 

MOS 

Total 

1  IB 

136 

19E 

31C 

63B 

64C 

71L 

91A 

95B 

Fort  Hood 

m  m 

— 

-- 

-- 

-- 

48 

-- 

42 

90 

Fort  Lewis 

29 

-- 

30 

16 

13 

— 

mm 

24 

— 

112 

Fort  Polk 

30 

— 

31 

26 

26 

.  — 

60 

30 

42 

245 

Fort  Riley 

30 

— 

24 

26 

29 

— 

21 

34 

30 

194 

Fort  Stewart 

31 

— 

30 

23 

27 

— 

— 

21 

— 

132 

USAREUR 

58 

150 

57 

57 

51 

155 

m  m 

58 

m  m- 

596 

Total 

178 

150 

172 

148 

156 

155 

129 

167 

114 

1,369 

Table  I I I. 38 

Field  Test  Sample  Soldiers  by  Sex  and  Race 


Sex 


Race 

Male 

Female 

Total 

Black 

330 

56 

388 

Hi spanlc 

37 

3 

40 

White 

789 

104 

893 

Other 

43 

_ 5 

Total 

1,199 

170 

1,369 
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Total 


The  Criterion  Measures 


As  described  in  the  earlier  sections,  the  general  procedure  for  crite¬ 
rion  development  in  Project  A  was  to  follow  a  basic  cycle  of  a  comprehensive 
literature  review,  conceptual  development,  test  and  scale  construction,  pilot 
testing,  test  and  scale  revision,  field  testing,  and  Proponent  (management) 
review. 


Criterion  Measurement  Goals 


The  primary  goals  of  criterion  measurement  in  Project  A  were  to  (a)  make 
a  state-of-the-art  attempt  to  develop  job  sample  or  hands-on  measures  of  job 
task  proficiency,  (b)  compare  hands-on  measurement  to  paper-and-pencll  tests 
and  rating  measures  of  proficiency  on  the  same  tasks  (l.e.,  a  multitrait, 
multimethod  approach),  (c)  develop  rating  scale  measures  of  performance 
factors  that  are  common  to  all  first-tour  enlisted  MOS  (Army-wide  measures), 
(d)  develop  standardized  measures  of  training  achievement  to  determine  the 
relationship  between  training  performance  and  job  performance,  and  (e)  eval¬ 
uate  existing  archival  and  administrative  records  as  possible  indicators  of 
job  performance. 

The  overall  criterion  development  effort  focused  on  three  major  methods: 
hands-on  samples,  multiple-choice  knowledge  tests,  and  ratings.  The  behav- 
lorally  anchored  rating  scale  (BARS)  procedure  was  extensively  used  in  devel¬ 
oping  the  rating  methods. 


Field  Test  Criterion  Battery 

The  complete  array  of  specific  criterion  measures  used  at  the  field  test 
sites  is  given  below.  Again,  the  distinction  between  MOS-specif 1c  and 

Army-wide  is  that  the  Army-wide  measures  are  the  same  across  all  MOS ;  that  is, 
the  same  questionnaire  or  the  same  rating  scale  is  used  for  all  examinees. 
The  content  of  the  HOS-specif 1c  measures,  regardless  of  whether  they  are  job 
samples,  knowledge  tests,  or  ratings,  is  specific  to  a  particular  job  and  is 

based  on  the  task  content  of  that  job.  Also,  the  judgment  (i.e.,  rating)  of 

"NCO  potential"  refers  to  a  first-tour  enlisted  soldier's  potential,  assuming 
the  Indivi  ual  would  reenlist,  for  being  an  effective  noncommissioned 

officer,  with  supervisory  responsibilities,  during  the  second  tour  of  duty. 

A.  MOS-Specific  Performance  Measures 

M  Paper-and-pencll  tests  of  achievement  during  training,  consist¬ 
ing  of  job-relevant  knowledge  tests  of  100  to  200  items  per 
MOS . 

-  Individual  item  scores 

-  Mean  test  scores 

2)  Paper-and-penci 1  tests  of  knowledge  of  task  procedures  consist¬ 
ing  of  an  average  of  about  nine  items  for  each  of  ?C  major 
tasks  for  each  MOS.  Item  scores  can  be  aggregated  in  at  least 
four  ways. 


-  Sum  of  item  scores  for  each  of  the  30  tasks. 

-  Total  score  for  15  tasks  also  measured  hands-on. 

-  Total  score  for  15  tasks  not  measured  hands-on. 

-  Total  score  on  all  30  tasks. 

3)  Hands-on  measures  of  proficiency  on  tasks  for  each  MOS, 
measured  on  15  tasks  selected  from  the  30  tasks  measured  with 
the  paper-and-penr.il  test. 

-  Individual  task  scores. 

-  Total  score  for  all  15  tasks, 

4)  Ratings  of  performance,  using  a  7-polnt  scale,  on  each  of  the 
15  tasks  measured  via  hands-on  methods  b.y: 

-  Supervisors 

-  Peers 

-  Self 

5)  Behavi orally  anchored  rating  scales  of  6-12  performance  dimen¬ 
sions  for  each  MOS  by: 

-  Supervisors 

-  Peers 

-  Self 

6)  A  general  rating  of  overall  MOS  task  performance  by: 

-  Supervisors 

-  Peers 

-  Self 

7)  A  job  history  questionnaire  administered  to  Incumbents  to 
determine  the  frequency  and  recency  of  task  performance  on  the 
30  tasks  being  measured. 

B,  Army-Wide  Measures 

1)  Eleven  behaviorally  anchored  rating  scales  designed  to  assess 
the  dimensions  listed  below.  Three  sets  of  ratings  (l.e., 
from  supervisors,  peers,  and  self)  were  obtained  on  each  scale 
for  each  Individual . 

-  Technical  Know! edge/Skl 1 1 

-  Initiative/Effort 

-  Following  Regulations/Orders 

-  Integrity 

-  Leading  and  Supporting 

-  Maintaining  Assigned  Equipment 

-  Maintaining  Living/Work  Areas 

-  Military  Appearance 

-  Physical  Fitness 

-  Self-Development 

-  Self-Control 
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2)  A  rating  of  general  overall  effectiveness  as  a  soldier  by: 

-  Supervisors 

-  Peers 

-  Self 

3)  A  rating  of  noncommissioned  officer  (NCO)  potential  by: 

-  Supervisors 

-  Peers 

-  Self 

4)  A  rating  of  performance  on  each  of  14  common  tasks  from  the 
Manual  of  Common  Tasks  by: 

-  Supervisors 

-  Peers 

-  Self 

5)  A  77-item  summated  rating  scale  measure  of  expected  combat 

effectiveness.2 

6)  A  14-Item  self-report  measure  (the  Personnel  File  Information 
Form)  of  certain  administrative  Indexes  such  as  awards, 
letters  of  commendation,  and  reenlistment  eligibility. 

7)  The  same  administrative  Indexes  taken  from  201  Files  (by 

project  staff). 

8)  The  Environmental  Questionnaire,  a  44-Item  descriptive 

questionnaire  completed  by  both  incumbents  and  supervisors  for 
the  purpose  of  describing  14  factors  pertaining  to 

organizational  climate,  structure,  and  practices.3 

9)  A  Leader  Behavior  Questionnaire  designed  to  permit  Incumbents 
and  supervisors  to  describe  leadership  policies  and  practices 
in  the  u n ! t . ^ 

10)  A  Measurement  Method  Questionnaire  administered  at  the  end  of 
the  testing  sessions  to  obtain  soldiers'  reactions  to  the 
various  types  of  testing. 


Administered  only  to  MOS  In  Batch  B  at  Fort  Riley. 
3See  Olson,  Borman,  Robertson,  and  Rose  (1984). 

^See  White,  Gast,  Sperling,  and  Rumsey  (1984). 


Procedure 


For  the  purpose  of  data  collection  In  the  field  tests,  the  criterion 
measures  were  divided  Into  four  major  blocks  corresponding  to: 

1.  Hands-on  (job  sample)  measures  (HO), 

2.  Rating  measures  (R)  -  both  Army-wide  and  MOS-sp^ol f 1 c . 

3.  Paper-and-pencll  measures  of  job  knowledge  (K5). 

4.  Paper-and-pencll  measures  of  training  achievement  (K3). 

Each  block  comprised  one-half  day  of  participant  time  and  each  participant 
was  tested  for  a  2-day  period. 

During  the  week  preceding  data  collection  at  each  research  site,  the 
scorers  for  the  hands-on  (job, sample)  measure  were  given  2  days  of  training 
on  scoring  procedures,  test  standardization,  and  the  overall  design  and 
objectives  of  Project  A, 


Advance  Preparation  on  Site 

This  activity  required  approximately ,3  days  per  test  site  for: 

•  Briefings  to  Commanders  of  the  units  supplying  the  troops  to 
clarify  the  test  objectives,  activities,  and  requirements. 

e  Examination  of  the  test  site,  equipment,  supplies,  and  special 
requirements  for  the  data  collection  and  set-up  of  the  hands-on 
test  stations. 

•  Training  of  the  test  administrators  and  scorers. 

•  A  "dry  run"  of  the  test  procedures. 

An  officer  and  two  NCOs  from  one  of  the  supporting  units  were  assigned 
to  support  the  field  test.  The  officer  provided  liaison  between  the  data 
collection  team  and  the  tested  units;  the  NCOs  coordinated  the  flow  of  equip¬ 
ment  and  personnel  through  the  data  collection  procedures.  Each  test  site 
had  a  test  manager  (TSM)  who  supervised  all  of  the  research  activity  and 
maintained  the  orderly  flow  of  personnel  through  the  data  collection  points. 

The  logistics  plan  and  test  schedule  were  reviewed  with  the  unit's 
administrative  staff,  and  civilian  and  military  scorers  and  other  data  per¬ 
sonnel  were  trained.  In  the  training  phase,  a  dry  run  of  the  procedures 
followed  the  data  collection  schedule  and  used  the  personnel  and  locations 
designated  for  the  test.  The  training  focused  on  the  handling  of  problem 
situations,  particularly  those  requiring  remediation  by  the  scientific  staff. 

Training  for  scorers  for  the  hands-on  measures  for  each  MOS  was  con¬ 
ducted  by  two  project  staff  members.  After  an  orientation  session,  staff 
members  reviewed  five  HO  tasks  with  the  scorers  by  describing  the  equipment/ 
material  requirements,  the  procedures  for  setting  up  testing  stations,  and 
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the  specific  instructions  for  administering  and  scoring  each  HO  test.  The 
scorers  then  alternated  evaluating  each  other  performing  the  tasks;  this  pro¬ 
vided  experience  both  In  administering  the  HO  tests  and  scoring  the  perform¬ 
ance  measures  of  each.  Project  staff  coached  the  "performers"  to  make 
unusual,  as  well  as  common,  Incorrect  actions  in  order  to  give  scorers  prac¬ 
tice  In  detecting  and  recording  errors.  The  above  procedure  also  Identified 
any  steps  where  local  standard  operating  procedures  (SOP)  differed  from  the 
test;  allowances  for  such  differences  were  made  In  the  test  Instructions. 
The  second  day  of  training  was  devoted  to  a  dry  run  of  the  test  procedures, 
with  all  scorers  simultaneously  evaluating  a  staff  member  performing  a  task. 
Problems  arising  from  the  Instructions,  test  procedures,  or  task  steps  were 
Identified  and  corrected. 


Administration  of  the  Measures 


The  administration  schedule  for  a  typical  site  (Fort  Stewart,  Georgia) 
Is  shown  In  Figure  III. 11.  The  field  test  proceeded  as  follows:  Thirty  MOS 
31C  and  30-  MOS  19E  soldiers  arrived  at  the  test  site  Thursday,  21  February 
1985  at  0745.  Each  MOS  was  divided  randomly  Into  two  groups  of  15  soldiers 
each,  Identified  as  Groups  A,  B,  C,  or  0.  Each  group  was  directed  to  the 
appropriate  area  to  begin  the  administration  appropriate  for  that  group. 
They  rotated  under  the  direction  of  the  test  site  manager  through  the 
scheduled  areas  according  to  the  schedule  shown  in  Figure  III. 11.  The 
sequence  was  repeated  for  30  MOS  91A  and  30  MOS  63B  soldiers  beginning  Monday 
(25  Feb  85)  and  for  30  MOS  11B  soldiers  on  Wednesday  (27  Feb  85).  The  order 
of  administration  of  the  measures  was  counterbalanced  among  the  groups. 

Before  any  Instruments  were  administered  to  any  soldier’,  each  was  asked 
to  read  a  Privacy  Act  Statement,  DA  Form  4368-R.  The  Background  Information 
and  Job  History  forms  were  then  administered,  with  30  minutes  allowed  for 
completion. 

Administration  of  Job  Samples  (15  tasks  measured  hands-on).  Depending 
on  the  task  being  measured,  the  location  was  outside  (e.g.,  vehicle  mainte¬ 
nance,  weapons  cleaning)  or  Inside  (e.g.,  measure  distance  on  a  map). 
Scorers  assigned  to  each  test  station  ensured  that  the  required  equipment  was 
un  hand  and  the  station  was  set  up  correctly,  and  followed  the  procedures  for 
administering  and  scoring  the  tests.  As  each  soldier  entered  the  test 
station,  the  scorer  read  the  Instructions  aloud  and  began  the  measure.  The 
length  of  time  a  soldier  was  at  the  test  station  depended  on  both  the  Indi¬ 
vidual's  speed  of  performance  arid  the  complexity  of  the  task. 


I 


I 


I 


MQS-Speclflc  Job  Knowl edge  Tests  (30  tasks,  half  of  them  also  In  HO 
testing!.  The  MOS-sp e r!Tf Ic  Knowledge  tests  are  grouped  Into  four  booklets  oT 

xhniit  nr  nht  t-aekw  nor  hnnklol1.  Farh  hnnHof  tnnt  ahm  it  rrHmii-oc 


about  seven  or  eight  tasks  per  booklet.  Each  booklet  took  about  45  minutes 
to  complete.  The  order  of  the  booklets  and  the  order  of  the  tasks  In  each 
rotated.  There  was  a  10-15  minute  "smoke  and  stretch"  break 
The  purpose  of  the  grouping  Into  booklets  was  to  try  t,o 
of  fatigue  and  waning  Interest. 


booklet  were 
hetween  booklets, 
control  the  effects 


raining  Achievement  Tests.  The  training  knowledge  test  for  each  MOS 

was  alternated  so  that 
booklets.  Again,  the 


was  In  three 
soldiers  sitting 


to  Ach 
book  1 


ets.  The  sequence  of  the  booklets 
next  to  each  other  had  different 
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31C  19E  91A  63B  11B 

Group*  B"  t  S’  £  F  G  fT  T  J 


Tuesday  - ----Scorer  Training  (All  Scorers) 

19  Feb  85 


Wednesday  . . . Scorer  Training  (All  Scorers) 

20  Feb  85 


Thursday  AM  PH  PK5  PK5  PK3 

21  Feb  85  PM  K3  H  R  R 


Friday  AM  RS  K3S  H  K5 

22  Feb  85  PM  MK5  MR  MK3S  MHS 


Monday 

AM 

PH 

PK5 

PK5 

P<3 

25  Feb  85 

PM 

*3 

H 

R 

R 

Tuesday 

AM 

RS 

K3S 

H 

<5 

26  Feb  85 

PM 

MK5 

MR 

MK3S 

MHS 

Wednesday 

AM 

PH 

PK5 

27  Feb  85 

PM 

K5 

H 

Thursday 

AM 

K3S 

R 

28  Feb  85 

PM 

MR 

MK3S 

*  Each  group  equals  15  soldiers  In  the  same  MOS, 

Code:  P  ■  Personal  and  Job  History  forms 

<3;<B  ■  Task  3  or  Task  5  knowledge  Measures 

H  ■  Hands-on  Measures 

R  ■  Peer  Ratings 

S  ■  Supervisor  (rater  and  endorser)  Ratings 

M  ■  Measurement  Method  Questionnaire 


Figure  III. 11.  Typical  field  test  administration  schedule. 
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purpose  of  using  booklets  was  to  try  to  control  the  effects  of  fatigue  and 
waning  interest.  Soldiers  had  45  minutes  for  each  booklet  and  a  10-15  minute 
"smoke  and  stretch"  break  between  booklets. 

Rating  Scales.  The  supervisory,  peer,  and  self  ratings  are  designed 
around  "rating  units."  Each  rating  unit  consists  of  the  Individual  soldier 
to  be  evaluated,  four  Identifiable  peers,  and  two  identifiable  supervisors. 
A  peer  Is  defined  as  an  individual  soldier  who  has  been  In  the  unit  for  at 
least  2  months  and  has  observed  the  ratee's  job  performance  on  several 
occasions.  A  supervisor  Is  defined  as  the  Individual's  first  or  second  line 
supervisor  (normally  his  rater  and  endorser). 

The  procedure  for  assigning  ratees  to  raters  (both  peers  and  super¬ 
visors)  consists  of  two  major  steps:  (a)  a  screening  step  that  determines 
which  raters  could  potentially  rate  which  ratees;  and  (b)  a  computerized 
random  assignment  procedure  that  assigns  raters  to  ratees  within  the  con¬ 
straints  that  1)  the  rater  Indicated  he/she  could  rate  the  ratee;  2)  ratees 
with  few  potential  raters  are  given  priority  In  the  randomized  assignment 
process;  3)  the  number  of  ratees  assigned  is  equalized  as  much  as  possible 
across  raters;  and  4)  the  number  of  ratees  any  given  rater  Is  asked  to  rate 
does  not  exceed  a  preset  maximum. 

The  potential  raters  were  given  an  alphabetized  list  of  the  ratees. 
They  were  told  the  purpose  of  the  ratings  within  the  context  of  the  research, 
and  the  criteria  (e.g.,  minimum  length  of  period  of  working  together)  they 
should  use  In  deciding  who  they  could  rate.  They  were  told  the  maximum 
number  of  people  they  would  be  asked  to  rate  and  that  assignments  of  ratees 
to  raters  would  be  made  randomly.  The  Importance  of  their  careful  and 
comprehensive  examination  of  the  list  of  ratees  was  emphasized. 

The  rating  scale  administrator,  using  the  training  guide,  then  discussed 
the  content  of  each  effectiveness  category,  and  urged  raters  to  avoid  common 
rating  errors.  A  major  thrust  of  this  training  was  an  attempt  to  standardize 
the  rating  task  for  the  raters.  With  the  lack  of  control  to  be  expected,  an 
Important  concern  was  that  all  raters  face  the  same  (or  a  very  similar) 
rating  task.  A  serious  potential  confounding  Involves  rating  unit  and  admin¬ 
istrator;  lower  average  ratings  for  some  rating  units  might  be  a  result  of 
different  sets  (l.e.,  "rate  more  severely")  provided  by  administrators 
handling  those  rating  units  rather  than  true  performance  deficiencies. 
Standardization  of  the  administration  helps  reduce  this  potential  problem.  A 
second  major  thrust  of  the  rater  training  was  to  minimize  the  amount  of  read¬ 
ing  the  raters  had  to  do.  This  was,  as  much  as  possible,  an  oral  adminis¬ 
tration;  the  rating  program  was  not  dependent  on  raters'  reading  large 
amounts  of  material . 


Planned  Analysis 

The  general  analytic  steps  were  straightforward  and  consisted  of  the 
following: 


1.  Icem  analysis  for  each  job  knowledge  test  for  each  MQS. 


2.  Item  analysis  for  the  training  achievement  tests  for  each  MOS.  An 
analysis  of  item  responses  was  done  lor  a  sample  of  50  trainees  as 

we  1 1  as  for  the  Incumbent  samples  in  the  field  tests. 

3.  An  Item  analysis  summary  table  for  each  knowledge  test  for  each 
MOS.  The  table  for  each  MOS  summarized  item  discrimination 
indexes,  Item  difficulties,  and  the  frequency  of  Items  that  were 
flagged  for  various  kinds  of  potential  keying  errors  (e.g.,  nega¬ 
tive  correlation  with  total  score,  high  frequency  of  response  for 
Incorrect  answer). 

4.  An  Item  (where  task  *  Item)  analysis  for  each  hands-on  (job  sample) 
test. 

5.  Frequency  distribution  and  scale  statistics  for  each  rating  scale 
for  each  MOS, 

6.  Interrater  reliabilities  for  the  Individual  rating  scales. 

7.  Split-half  correlations  (Spearman-Brown  estimates)  for  the  knowl¬ 
edge  tests  and  hands-on  measures,  test-retest  coefficients  for  the 
hands-on  measures,  and  Internal  consistency  indexes  where 
applicable. 

0.  A  complete  Intercorrelation  matrix  of  all  the  criterion  variables 
for  each  MOS  down  to  the  scale  score  and  task  score  level  (l.e., 
the  matrix  Included  all  the  variables  listed  in  the  previous 
sections). 

9.  A  set  of  reduced  intercorrelations  matrixes  that  included  subsets 
o*  the  total  array  of  variables. 

10.  Factor  analyses  for  selected  matrixes,  primarily  those  having  to  do 
with  the  rating  scale  measures. 

11.  For  a  selected  number  of  variable  pairs,  correction  of  the 
Intercorrelation  for  attenuation  In  an  attempt  to  estimate  the 
correlation  between  the  true  scores. 


Interpretation  and  Use  of  the  Field  Test  Results 

The  results  of  the  above  analyses  were  prepared  in  a  master  data  book 
for  each  MOS.  Each  data  book  contained  Item  and  scale  analyses,  Intercorre¬ 
lations  down  to  the  scale  and  subscale  level,  and  factor  analyses  of  selected 
data  sets. 

These  data  were  then  carefully  scrutinized  by  a  designated  criterion 
analysis  group.  The  group  included  the  principal  Investigator  for  each  of 
the  criterion  measures;  consequently,  for  each  variable  there  was  at  least 
one  committee  member  with  a  strong  vested  interest.  The  other  members  of  the 
committee  consisted  of  the  principal  scientist  for  the  project,  the  Army 


Research  Institute's  chief  scientist  for  the  project,  and  one  hapless  indi¬ 
vidual  (the  assistant  project  director)  who  had  to  serve  as  chair--10  people 

in  oil. 

The  objectives  of  the  group  were  to  review  the  results  of  the  field 
tests  and  to  agree  upon  the  specific  revisions  that  were  to  be  made  in  each 
criterion  measure  before  the  criterion  array  was  declared  the  set  of  crite¬ 
rion  measures  that  would  be  used  for  the  Concurrent  Validation.  The  mode  of 
operation  was  for  the  principal  investigator  responsible  for  each  criterion 
to  review  carefully  the  relevant  field  test  data  and  propose  specific  revi¬ 
sions,  additions,  or  deletions  aimed  at  maximizing  the  reliability,  accept¬ 
ability,  and  construct  validity  of  the  measure.  A  general  discussion  then 
followed,  continuing  until  the  investigator's  proposal  was  accepted  or  a 
consensus  was  reached  on  what  specific  changes  should  be  made. 

The  obvious  disadvantage  of  the  committee  approach  to  data  interpreta¬ 
tion  is  the  time  Involved.  More  than  once  the  membership  wished  for  a  good 
dose  of  totalitarian  power.  On  the  positive  side,  all  the  major  benefits  of 
participative  decision  making  seemed  to  manifest  themselves.  Everyone  con¬ 
cerned  always  knew  what  was  being  dono,  crucial  Issues  tended  not  to  get 
lost,  Investigators  could  exercise  veto  power  if  the  Integrity  of  their 
product  was  being  threatened,  and  considerable  commitment  seemed  to  have  been 
generated.  On  balance,  the  time  Investment  seemed  worth  It.  In  truth,  on 
such  a  large,  multifaceted  project  it  probably  is  not  possible  for  one 
"expert"  to  make  these  decisions  unilaterally.  If  the  Project  A  model  Is 
used  in  the  future  with  any  frequency,  applied  psychologists  must  learn  how 
to  "manage"  data  Interpretation  os  well  as  data  collection. 

The  following  sections  summarize  the  major  findings  generated  by  the 
above  analyses  and  outline  the  revisions  made  in  the  performance  measures  as 
a  result.  Results  pertaining  to  the  self  ratings  are  not  Included  in  these 
summaries;  Initial  analyses  indicated  that  the  self  ratings  suffered  from 
relatively  more  halo,  central  tendency,  and  leniency  error  than  did  super¬ 
visor  and  peer  ratings,  and  self  ratings  were  not  considered  further. 


beet  ion  9 

FIELD  TEST  RESULTS:  TRAINING  ACHIEVEMENT  TESTS1 


Descriptive  Statistics  for  Field  Tests 


The  descriptive  statistics  for  the  training  achievement  tests  adminis¬ 
tered  In  the  field  tests  to  job  Incumbents  are  shown  In  Table  111,39.  Test 
scores  are  based  on  the  Items  judged  relevant  for  the  job  or  for  the  job  and 
for  training  content.  Those  few  Items  judged  relevant  only  for  training 
content  (see  Table  III.  10)  are  not  Included  because  the  respondents  being 
tested  were  job  Incumbents  rather  than  trainees. 


These  data  are  for  Batches  A  and 
were  not  field  tested.  Mean  values  for 
19  MOS  (Section  2)  have  been  recomputed 
t icl pated  In  the  field  tests;  trainee 
maten  closely.  Mean  trainee  alpha  for 
cumbent  alpha  Is  .877,  Mean  correct 
54,5*.  for  job  Incumbents. 


B  only  (nine  MOS)  since  Batch  1  MOS 
the  previous  trainee  figures  based  on 
including  only  the  nine  MOS  that  par- 
and  field  test  job  Incumbent  results 
the  nine  MOS  was  .882,  and  mean  1n- 
for  trainees  was  53,9%,  compared  to 


Revisions  to  Training  Achievement  Tests 
Reduction  In  Number  of  Items  for  Concurrent  Validation 


Because  of  time  constraints,  the  length  for  tho  Concurrent  Validation 
versions  of  the  training  tests  would  be  limited  to  approximately  150  Items. 
To  reduce  the  size  of  the  Item  pool,  any  Item  that  had  boon  rated  not  rele¬ 
vant  to  the  job  and  also  not  relevant  to  training  was  dropped  first.  To 
reduce  test  length  further,  Items  were  dropped  that  had  been  rated  lowest  In 

Importance  and/or  highest  la  difficulty.  Because  the  training  performance 

domain  was  assumed  to  be  multidimensional,  Items  were  not  usually  eliminated 
solely  because  of  a  negative*  blserlal  correction  with  the  total  test  score. 
However,  some  Items  were  dropped  that  exhibited  the  three  characteristics  of 
(a)  low  pass  rate,  (b)  negative  biserial,  and  (c)  a  distractor  or  dlstractors 
with  a  high  positive  biserial.  During  the  revision  of  the  Item  pools,  the 
relative  frequency  of  Items  In  each  job  task  duty  area  was  maintained  as  It 
had  been  previously. 

Tables  III. 40,  III. 41,  and  III. 42  report  the  number  of  Items  remaining 
on  each  test  after  the  revisions  had  been  made.  The  versions  to  be  used  for 
the  Concurrent  Validation  contained  the  number  of  Items  shown  In  the  columns 
on  the  far  right.  Tho  tables  for  Batches  A  and  B  differ  slightly  from  the 

table  for  Batch  7.  because  many  of  the  Batch  A  and  B  Item  reductions  were  made 

using  field  test  data,  which  are  not  available  for  Batch  Z. 

development  of  the  training  achievement  tests  was  described  in  Section  2, 
Part  III.  Section  9  Is  based  primarily  cn  ARI  Technical  Report  757, 
Development  and  Field  Test  of  Job-Relevant  Knowledge  Tests  for  Selected  MOS, 
5y  Hbbe rOrT)a-v  1  s’,  TSre gory" "A7"'Da vTsTTohn  '  l~~Toynert  TncTfTaH  a  Verb n Tea  Te 
Vera,  and  the  supplementary  ARI  Research  Note  In  preparation,  which  contains 
the  report  appendixes. 
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Table  III. 39 

Results  From  Training  Achievement,  Field  Tests  Administered  to  Incumbents 


MOS 

Number 

of 

Subjects 

Number 

of 

Items 

Mean 

Number 

Correct 

SD 

Range 

Alpha 

Mean 

Percent 

Correct 

Batch  A 

13B 

149 

133 

49.2 

16.5 

74 

.90 

44.5 

64C 

155 

137 

70.3 

17.2 

75 

.91 

51.3 

71 L 

129 

97 

50.5 

9.9 

51 

.83 

52.1 

95B 

112 

131 

77.3 

10.2 

51 

.76 

59.0 

Batch  B 

11!) 

166 

162 

86.4 

20.0 

98 

.93 

53.3 

19E 

169 

193 

112.9 

21.0 

142 

.93 

58.5 

31 C 

143 

176 

99.6 

20.1 

120 

.92 

55.6 

63B 

155 

205 

106.9 

19.4 

107 

.90 

52.1 

91 A 

155 

115 

72.9 

10.3 

76 

.82 

63.4 

Review  by  TRADOC  Proponent  Agencies 

Before  being  administered  to  Job  Incumbents  as  part  of  the  Concurrent 
Validation,  each  Item  pool  was  submitted  to  the  appropriate  TRADOC  Proponent 
for  review.  The  number  of  items  sent  out  for  review  and  the  number  of  Items 
eliminated,  added,  or  modified  as  a  result  of  this  review  are  also  summarized 
In  Tables  III. 40,  III. 41,  and  III. 42. 


Comparison  of  Initial  Item  Pool  and  Concurrent  Validation  Version 

When  Initial  Item  pool  and  Concurrent  Validation  versions  are  compared, 
there  Is  a  small  increase  In  the  percentage  of  Items  rated  very  Important  and 
a  small  decrease  In  the  proportion  of  Items  rated  of  little  Importance  on 
both  the  combat  scenario  (Very  Important,  33.1  to  34.0*;  Of  Little  Impor¬ 
tance,  22.8  to  20.6*)  and  the  garrison  scenario  (Very  Important,  43.1  to 
46.5*;  Of  Little  Importance,  11.2  to  8.3*).  These  changes  are  all  In  the 
expected  direction,  given  the  procedures  that  were  used  to  revise  the  Initial 
Item  pools. 

Mean  Importance  ratings  across  MGS  for  item  pool  and  Concurrent  Valida¬ 
tion  versions  of  the  tests  for  each  scenario  were  also  compared.  All  changes 
are  In  the  expected  direction  (1.^,,  higher  Importance  on  the  Concurrent 
Validation  test  version  than  on  U-  Item  pool),  and  two  are  significant  when 
compared  using  the  Wllcoxon  Matched  Pairs  test:  combat  scenario  (Initial 
Item  Pool  vs.  Concurrent  Validation  version),  Z  »  1.73,  p  ■  .08;  combat 
readiness  scenario  (Initial  Item  Pool  vs.  Concurrent  Validation  version), 
Z  *>  2.01,  £  =  .04;  garrison  scenario,  Z  a  2.36,  £  «  .004. 
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No  budget  balancing  was  needed  for  Batch  A  tests. 

There  were  163  items  common  between  the  SP  A  T  versions 
67  unique  to  T. 


Table  I I I. 41 
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Reduced  to  this  number  to  eliminate  time-consuming  math  items. 


For  the  version  of  the  tests  administered  as  part  of  the  Concurrent 
Validation,  the  distribution  across  relevance  categories  is  nearly  the  sane 

as  for  the  original  Item  pool. 


Some  Lessons  Learned 


Since  this  was  such  a  large-scale  test  development  effort  conducted  over 
a  relatively  short  period  of  time,  a  number  of  things  were  learned  in  addi¬ 
tion  to  the  psychometric  properties  of  the  scales.”  We  summarize  a  few  of 
these  below. 


Item  Tracking 

Developing  more  than  200  test  Items  for  each  of  the  20  different  MOS 
required  keeping  track  of  data  on  more  than  4,000  test  items,  through  several 
revisions.  To  do  this,  each  draft  item  was  assigned  a  master  number,  and  a 
large  table  was  constructed  for  each  H0S  showing,  for  each  version  of  the 
item  pool  (version  shown  to  incumbents,  version  used  in  the  field  test, 
etc.),  the  test  booklet  number  of  that  item  and  the  item  In  its  revised 
form.  Into  these  same  tables  were  entered  the  A05P  (job  task)  duty  area  for 
the  item,  whether  the  item  was  judged  relevant  to  training  and  to  the  job, 
whether  the  Item  was  modified  or  dropped  from  the  pool,  and  so  forth. 

Tracking  items  is  further  complicated  by  the  fact  that  as  items  are 
reviewed,  many  are  changed  significantly.  Judgments  regarding  relevance  and 
Importance  refer,  of  course,  to  a  particular  item  at  a  given  point  in  time. 
After  each  Item  change,  a  judgment  must  be  made  as  to  whether  or  not  the  item 
is  still  the  "same"  Item.  If  not,  then  the  original  item  must  be  recorded  as 
dropped,  and  a  new  Item  with  a  new  master  number  (linked  to  the  old)  must  be 
entered  into  the  item  pool.  An  automated  database  program  running  on  a  small 
(or  large)  computer  appears  necessary  in  an  effort  of  this  magnitude. 


Evolution  of  Item  Budgets 

Item  budgets  were  originally  developed  to  help  assure  that  the  content 
domain  for  each  test  would  be  clear,  representative,  and  relevant.  Such 
budgets  also  serve  the  Important  functions  of  guiding  and  providing  disci¬ 
pline  to  item  writers  who  often  do  not  understand  the  psychometric  Issues 
involved  in  test  construction. 

Since  the  original  pool  of  Items  was  larger  than  needed  for  the  tests, 
it  was  possible  to  keep  reworking  the  budgets,  to  ensure  that  the  content 
domain  was  appropriately  sampled.  The  important  point  to  note  here  is  that 
the  original  budgets  were  a  starting  point  in  the  test  development  process. 
The  $ME  and  Proponent  reviews  provided  important  insights  to  ensure  that  the 
training  achievement  tests  were  indeed  content  valid. 

A  reasonable  way  to  track  oudgets  is  to  set  up  spread  sheets  that  fore¬ 
cast  the  number  of  items  needed  in  specific  content  areas  as  the  item  pools 
evolve  into  actual  tests. 


Summary  and  Discussion 


Th e  major  objective  for  this  part  of  Project  A's  criterion  development 
was  to  create  content-valid  and  reliable  training  achievement  tests  for  mea¬ 
suring  the  cognitive  component  of  training  success.  How  successful  were 
these  efforts? 

The  tests  in  Batches  A  and  B  may  be  evaluated  from  three  perspectives. 
First,  since  content  validity  Is  so  crucial,  one  can  examine  the  process  by 
which  the  tests  were  developed  and  use  some  of  the  standards  identified  by 
Guion  (1977)  and  others  as  criteria  for  evaluating  that  process.  Second,  one 
can  consider  the  development  process  up  to  the  point  of  the  final  Proponent 
review,  which  Indeed  was  an  added  stem  In  the  process,  and  compare  the  tests 
before  and  after  Proponent  review.  The  assumption  here  is  that  if  the  tests 
undergo  relatively  little  change  (particularly  fundamental  change  such  as 
rutting  items  and/or  adding  new  items)  as  a  result  of  the  final  Proponent 
review,  the  development  process  as  originally  conceived  was  valid.  Finally, 
one  can  look  at  descriptive  psychometric  indexes  such  as  reliability  and  item 
distributional  characteristics. 

The  development  process  did  conform  to  the  three  criteria  of  domain 
definition,  content  representativeness,  and  content  relevance.  First,  the 
domain  was  operationally  Identified  and  items  were  drawn  from  that  content. 
The  developmental  model  prescribed  that  the  Initial  items  vVou 1 d  be  drawn  from 
published  Army  literature.  Since  the  published  literature  inevitably  lags 
behind  practice  (i.e.,  doctrine  and  equipment),  some  change  was  inevitable  as 
subject  matter  experts  examined  items.  Nevertheless,  the  changes  were,  in 
most  cases,  not  dramatic  and  many  concerned  terminology  or  phrasing  rather 
than  content. 

With  respect  to  content  representativeness,  the  proportions  of  items 
assigned  to  different  duty  areas  on  different  versions  of  the  test  are 
similar.  Inevitably,  there  were  some  modifications  in  the  percentage  of 
Items  in  a  given  duty  area,  but  radical  changes  in  the  distribution  of  items 
across  duty  areas  were  not  necessary. 

Elaborate  procedures  were  used  to  determine  content  relevance.  Items 
judged  by  experts  as  being  not  relevant  to  training  and/or  the job  were 
eliminated.  Moreover,  relevance  was  judged  in  terms  of  Importance;  only 
those  items  judged  to  bo  very  important  on  one  or  more  of  the  three  scenarios 
were  retained  in  the  item  pool. 

With  respect  to  fairness,  as  procedures  were  being  developed  for  review 
of  items  by  subject  matter  experts,  guidelines  were  developed  and  Implemented 
to  ensure  that  the  groups  reviewing  the  Items  were  balanced  for  race  and 
gender. 

Next  Is  the  question  of  whether  the  Proponent  review  altered  or  changed 
the  tests.  The  short  answer  is:  With  one  or  two  exceptions,  not  very  much. 
Proponents  requested  three  types  of  changes.  The  mean  percentages  of  ttiese 
changes  across  all  19  MOS  were  as  follows:  cuts,  7.5/;  additions,  1.4/;  and 
modifications,  9.4/..  When  one  considers  the  lengths  of  the  tests,  these 
percentages  are  not  very  great.  Furthermore,  modifications  were  In  many 
cases  relatively  trivial  and  did  not  concern  content  so  much  as  format  or 


phrasing.  The  distributions  of  these  changes  were,  however,  quite  skewed. 
By  consulting  Tables  III.4C,  III. 41,  and  III. 42,  one  can  note  that  the  post 
significant  disagreements  occurred  for  MOS  16B  (cuts),  54E  (cuts),  I  IB 
(cuts),  and  63B  (modifications).  Items  were  added  to  these  tests  to 
rebalance  their  respective  item  budgets. 

Finally,  the  tests  can  be  evaluated  in  terms  of  uore  traditional  psycho¬ 
metric  measurements,  particularly  reliability.  All  of  the  tests  have  high 
reliability  coefficients  and  reasonable  distributal  properties.  In  total 
they  appear  to  possess  considerable  content  validity  for  their  intended 
purpose. 
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Section  10 

FIELD  TEST  RESULTS:  TASK-BASED  MOS-SPECIFIC  CRITERION  MEASURES* 


The  analyses  of  the  field  test  data  for  the  task-specific  performance 
measures  used  three  general  kinds  of  Information.  First,  extensive  Item/ 
scale  analyses.  Including  the  calculation  of  reliability  Indexes,  were  car¬ 
ried  out  on  each  measure.  Second,  the  Intercorrelatlons  among  the  different 
measures  were  examined.  Finally,  consideration  was  given  to  SME  and  staff 
judgments  on  the  relative  sultabll ■* t **  of  Job  sample  vs.  paper-and-pencll  for 
assessing  specific  task  performance. 


Item/Scale  Analyses 

The  basic  psychometric  properties  of  each  measure  are  described  In  turn 
for  the  job  knowledge  tests,  the  job  sample  or  hands-on  tests,  the  task  per¬ 
formance  rating  scales,  and  the  Job  History  Questionnaire. 


Job  Knowledge  Tests 

The  output  generated  by  the  Item  analysis  procedure  for  the  knowledge 
tests  Included,  for  each  Item,  the  number  and  percentage  who  selected  each 
alternative,  and  for  each  Item  alternative,  the  Brogden-Clemans  Item-total 
correlation  (In  which  the  total  score  represents  the  sum  of  all  the  Items 
used  to  assess  knowledge  of  a  specific-  task,  excluding  the  Item  being  cor¬ 
related  with  the  total).  Recall  that  there  were  about  30  task  total  scores 
In  each  job  knowledge  test. 

Although  Items  with  extremely  high  or  low  difficulties  provide  relative¬ 
ly  few  discriminations,  some  such  items  might  still  be  retained  to  enhance 
test  acceptability  (e.g.,  because  of  the  Importance  of  the  content)  or  to 
preserve  a  measure  of  a  task  to  be  tested  across  several  MOS.  Also,  Items 
with  low  Item-total  correlations  might  be  deficient  In  some  respect  or  might 
simply  be  Increasing  test  content  heterogeneity.  Since  neither  type  of 
information  provided  conclusive  evidence  regarding  an  item's  utility,  both 
were  applied  in  a  judicious  and  cautious  manner. 

Those  Items  that  were  particularly  easy  (more  than  95*  pass)  or 
particularly  difficult  (fewer  than  10*  pass),  or  that  had  low  or  negative 
item-total  correlations  were  examined  first  for  keying  errors  or  obvious 
sources  of  cueing.  Deficient  items  that  could  not  be  corrected  ware  then 


lDevelopment  of  these  measures  was  described  In  Section  3,  Part  III.  Section 
10  Is  primarily  based  on  ART  Technical  Report  717,  Development  and  Field  Test 
of  Task-Based  MOS - S pe c  1  f  1c  Criterion  Measures,  by  Charlotte  H.  Campbel 1 , 
Roy  Campbel  1  ,  Michael  (T  Rumsey ,  an<3  forothy  C.  Edwards,  and  the 
supplementary  ARI  Research  Note  In  preparation,  which  contains  the  report 
.appendixes . 


deleted,  and  the  Item  analysis  was  produced  again.  The  process  was 

iterative;  various  sets  of  Items  were  included  In  the  analysis,  and  the  set 
that  produced  the  highest  coefficient  alpha  for  the  entire  knowledge  task 
test  with  an  acceptable  pass  rate  (between  15%  and  90%)  was  retained. 

Revisions  were  made  on  between  14  and  18%  of  the  items  in  each  MOS  set, 
and  between  17  and  24%  of  the  items  were  dropped.  Following  Item  deletions, 
the  distributions  of  Items  with  regard  to  difficulty  and  item-total  correla¬ 
tions  for  each  of  the  nine  MOS  were  as  summarized  in  Table  III. 43.  The 
median  difficulty  levels  were  55  to  58%  for  five  of  the  MOS,  with  the  MOS 
630,  9 1 A ,  19E,  and  950  tests  having  medians  of  C5  to  74%.  Although  some  skew 
in  item  difficulties  was  observed,  it  was  not  extreme. 


Table  III. 43 

Summary  of  Item  Difficulties  (Percent  Passing)  and  Item-Total  Correlations 
for  Knowledge  Test  Components  in  Nine  MOS 


Number 


MOS 

of  Items 

Mean 

Median 

Min 

Max 

13B 

Cannon  Crewman 

236 

Dlfflculty(X)  59.2 

55.5 

13.4 

97.2 

Item-Total (rj  .38 

.38 

-.06 

.88 

64C 

Motor  Transport 

166 

Dlfflculty(X)  60.7 

58.0 

03.6 

94.3 

Operator 

Item-Total  (£)  .31 

.32 

-.00 

.91 

71L 

Administrative 

170 

Difficulty^)  57.4 

56.5 

04.7 

96.1 

Specialist 

Item-Total (£)  .30 

.31 

-.19 

.84 

968 

Military  Police 

177 

01 ff1culty(%)  66.4 

74.0 

00.0 

100.0 

Item-Total (rj  .33 

.32 

.00 

.82 

1  IB 

Infantryman 

228 

01 fflculty(S)  57.3 

55.4 

05.3 

97.1 

Item-Total  (rj  .30 

.31 

-.39 

.88 

19E 

Armor  Crewman 

205 

Dlfflculty(X)  64.6 

66,8 

13.4 

96.9 

Item-Total (rj  .32 

.31 

-.26 

.95 

3 1C 

Single  Channel 

211 

Dlfflculty(X)  58.0 

57.1 

11.3 

95.4 

Radio  Operator 

Item-Total (rj  .31 

.31 

_  -.09 

.84 

63B 

Light  Wheel 

197 

Dlfflculty(X)  65.1 

64.5 

07.8 

97.4 

Vehicle  Mechanic 

Item-Total (rj  .30 

.30 

-.13 

.92 

91A 

Medical  Specialist 

236 

Dlfflculty(X)  66.9 

69.0 

08.6 

98.7 

Item-Total (r)  .32 

.32 

-.25 

.78 
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The  item-total  correlation  distributions  were  also  highly  similar  across 

the  nine  MOS,  with  most  Items  exhibiting  correlations  of  .21  to  .40  In  each 
MOS.  Pruning  Items  on  the  basis  of  low  correlations  was  done  very 
conservatively.  As  a  result,  there  remained  In  each  knowledge  test  Items 
with  low  or  negative  correlations  with  the  task  total  score.  These  ranged 
from  951  of  the  Items  In  the  MOS  13?  (Cannon  Crewman)  tests  to  29#  of  the 
Items  In  the  MOS  19E  (Armor  Crewman)  tests  that  had  correlations  below  .20. 
Negative  correlations  were  found  In  no  more  than  8.0#  of  the  Items  In  any  of 
the  nine  MOS.  The  average  of  the  Item-total  correlations  in  the  various 
knowledge  components  ranged  from  .30  to  .38. 

The  means,  standard  deviations,  and  reliabilities  for  the  total  test 
score  In  each  MOS  are  shown  In  Table  III. 44.  The  reliabilities  are 
split-half  coefficients,  using  15  task  tests  In  each  half,  corrected  to  a 
total  length  of  30  task  tests. 


Table  III. 44 

Means,  Standard  Deviations,  and  Split-Half  Reliabilities  for 
Knowledge  Test  Components  for  Nine  MOS 


MOS 

Mean 

(*) 

Standard 

Deviation 

Split-Half 

Reliability*' 

13B 

-  Cannon  Crewman 

58.9 

12.6 

.P(5 

64C 

-  Motor  Transport  Operator 

60.3 

10.1 

.79 

71L 

-  Administrative  Specialist 

55. P 

10.4 

.PI 

95B 

-  Military  Police 

66.4 

9.2 

.75 

IIP 

-  Infantryman 

56.  C 

10.5 

.-.’1 

19E 

-  Armor  Crewman 

64.0 

10.1 

.90 

31 C 

-  Single  Channel  Radio  Operator 

57.7 

9.6 

.  P4 

63B 

-  Light  Wheel  Vehicle  Mechanic 

64.4 

9.1 

.86 

91A 

-  Medical  Specialist 

69.  P 

8.1 

.85 

a  Fifteen  task  tests  In  each  half,  corrected  to  a  total  length  of  30  tests. 


For  all  MOS,  the  majority  of  individual  task  test  means  were  between 
about  35  and  85%;  total  score  means  were  from  55  to  70#.  The  standard 
deviations  were  also  similar  across  the  nine  MOS,  and  although  coefficient 
alphas  were  variable  across  tasks,  split-half  reliabilities  were  In  the  .70s 
and  .90s  for  total  job  knowledge  score. 

The  reliabilities  (coefficient  alpha)  of  task  tests  appearing  In 
multiple  MOS  are  shown  In  Table  III. 45.  The  magnitude  of  the  correlations 
Is,  for  somq  Individual  task  tests,  disappointing.  However,  a  number  of  the 
subtests  were  very  short,  containing  no  more  than  3-5  items. 
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Table  III. 45 


Coefficient  Alpha  of  Knowledge  Tests  Appearing  In  Multiple  MOS 


Test 

13B 

64C 

Ilk 

95B 

11B 

19E 

31C 

Mi 

91A 

Perform  CPR 

•  31 

.34 

.38 

.33 

.38 

.41 

.55 

Administer  nerve  antidote  to  self 

.55 

.39 

.36, 

Prevent  shock 

.22 

.12 

.31 

Put  on  field  dressing 

.34 

.39 

.39 

.19 

.15 

.31 

.16 

,31 

Administer  nerve  agent  antidote 

to  buddy 

.58 

.32 

Load,  reduce  stoppage,  el  ear  H16 

.56 

.46 

.47 

.52 

.51 

.32 

.43 

Perform  operator  maintenance  on  M16 

.31 

.38 

.39 

.44 

.22 

Load,  reduce,  dear  M60 

.30 

.40 

.47 

Perform  operator  maintenance  .45 

.45 

.36 

Determine  azimuth  using  a  compass 

.81 

•84 

.74 

Determine  grid  coordinates 

.23 

.53 

.67 

.79 

.74 

.70. 

.74 

Decontaminate  skin 

.71 

.42 

.48 

*47 

.47 

Put  on  M16  mask 

.60 

.49 

.44 

.66 

.49 

.33 

Put  on  protective  clothing 

.56 

.55 

.31 

.40 

.31 

.62 

.39 

.40 

Maintain  M17  mask 

.38 

.53 

.28 

Challenge  and  Password 

.46 

.48 

.41 

know  rights  as  POW 

.48 

.45 

.44 

Noise,  light,  litter  discipline 

.38 

.12 

.07 

Move  under  fire 

.59 

.56 

Identify  armored  vehicles 

.62 

.54 

.58 

.75 

.57 

.58 

Camouflage  equipment 

.31 

.31 

.08 

Camouflage  self 

.06 

.47 

.48 

Report  Information  -  SALUTE 

.76 

.84 

.74 

Operate  vehicle  In  convoy 

.40 

.36 

Step/Scale  Analyses  for  Hands-On  Tests 

For  each  hands-on  step,  the  number  and  percentage  who  scored  GO  and 
NO-GO  were  determined.  The  Brogden-Clemans  biserial  was  computed  for 
hands-on  steps  just  as  for  knowledge  test  Items;  that  Is,  the  step  was 
correlated  with  the  total  task  score  minus  that  step.  Recall  that  for  the 
hands-on  tests  there  were  15  task  scores. 

Steps  that  had  low  or  negative  correlations  with  the  total  task  score 
were  reviewed  to  Identify  situations  where  performance  scored  as  NO-GO  was  In 
fact  prescribed  by  local  practices,  and  was  as  correct  at  that  site  as 
doctrinal ly  prescribed  procedures.  Instructions  to  scorers  and  to  soldiers 
were  revised  as  necessary  to  ensure  consistent  scoring. 
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Use  of  step  difficulty  data  to  revise  hands-on  tests  was  limited  by  a 
number  of  considerations.  First,  a  task  test  utually  represents  an 
Integrated  procedure.  Each  Individual  step  must  typically  be  performed  by 
someone  in  order  for  the  task  to  continue.  Removal  of  a  step  from  a 
scoresheet,  regardless  of  Its  psychometric  properties,  might  oniy  confuse  or 
frustrate  the  scorer.  Second,  removal  of  a  step  which  the  Soldier's  Manual 
specifies  as  a  part  of  the  job  may  result  In  deleting  a  doctrinal  requirement 
and  undermining  the  acceptability  of  the  hands-on  measure  to  management. 

Because  of  these  considerations  very  few  performance  measures  were 
dropped  from  scoring  nn  hands-on  tests,  regardless  of  their  difficulty 
level.  However,  under  certain  limited  circumstances,  exceptions  were  made. 
On  a  very  few  hands-on  tasks,  the  test  steps  represent  a  sample  of  perfor¬ 
mance  from  a  large  domain  (e.g.,  Identify  Threat  Aircraft,  Recognize  Armor 
Vehicles,  Give  Visual  Signals).  Individual  steps  could  be  deleted  without 
damaging  task  coverage  or  test  appearance. 

Three  types  of  reliability  data  were  explored  for  hands-on  task  tests: 
test-rete$t,  Interscorer  reliability,  and  Internal  consistency,.  For  reasons 
discussed  below,  only  the  Internal  consistency  data  were  systematically  used 
In  test  revision. 

So  that  test-retest  reliability  could  be  computed,  all  soldiers  In  the 
Batch  A  MOS  were  retested  on  a  subset  of  the  same  tasks  they  had  boen  tested 
on  Initially.  Due  to  scheduling  and  resource  constraints,  the  Interval 
between  first  and  second  testing  was  only  2  to  7  days.  Thus,  memory  of 
Initial  testing  was  a  probable  contaminant  of  retesting  performance. 
Soldiers  were  aware  that  they  would  be  retested  and  some  were  found  to  have 
trained  to  Improve  their  performance  between  the  two  testing  sessions.  Thus, 
training  was  not  consistent  across  soldiers,  but  varied  partly  as  a  function 
of  motivation  and  partly  as  a  function  of  the  extent  to  which  special  duties 
restricted  training  opportunities.  Scores  improved  on  second  testing  for 
many  soldiers,  On  the  other  hand,  some  soldiers  resented  having  to  repeat 
the  test;  some  told  the  scorer  that  they  were  unfamiliar  with  the  task,  when 
In  fact  they  had  scored  very  high  on  initial  testing.  Thus,  retest  scores 
were  contaminated  widely  and  variably  by  motivational  factors.  Overall, 
test-retest  was  of  limited  utility  and  was  not  collected  for  Batch  B 
soldiers. 

The  use  of  alternate  forms  of  a  test  offers  an  approximation  of  test- 
retest  reliability.  However,  development  and  large-scale  administration  of 
alternate  forms  In  either  hands-on  or  knowledge  mode  was  beyond  the  resources 
of  the  project. 

An  attempt  was  made  in  Batch  h  to  acquire  Interscorer  reliability  esti¬ 
mates  by  having  a  Project  A  staff  member  score  the  soldier  at  the  same  time 
the  NCO  was  scoring.  Two  factors  limited  the  feasibility  of  this  approach. 
First,  sufficient  personnel  were  not  available  to  monitor  all  eight  stations 
within  an  MOS  for  the  length  of  time  required  to  accumulate  sufficient,  data. 
The  problem  was  exacerbated  when,  for  whatever  reason.  It  was  necessary  to 
test  two  MOS  simultaneously.  Second,  for  some  MOS,  particularly  those  per¬ 
formed  In  the  radio-teletype  rig  for  MOS  31C  and  1ri  the  tank  for  MOS  19E,  It 
was  difficult  or  even  impossible  to  have  multiple  scorers  without  interfering 


with  uither  the  examinee  or  the  primary  scorer.  Because  of  these  factors, 
Insufficient  interscorer  reliability  data  were  available  to  systematically 
affect  the  process  of  revising  task  measures. 


By  process  of  elimination,  the  reliability  .ms  a  sure  of  choice  for  the 
hands-on  test  was  an  Internal  consistency  estimate,  using  split-half.  Table 
III. 46  shows,  for  each  MOS,  the  means,  standard  deviations,  and  split-half 
reliability  estimates  of  the  hands-on  components  across  revised  task  tests. 
The  mean  task  scores  tend  to  fall  between  40  and  S0%,  with  a  few  very 
difficult  tasks  and  a  few  very  easy  tasks  for  most  soldiers  in  each  MOS.  The 
standard  deviations  for  task  tests  are  In  many  cases  high  relative  to  the 
means.  This  Is  at  least  In  part  an  artifact  of  the  sequential  nature  of  many 
of  the  hands-on  tests:  If  soldiers  cannot  perform  early  steps,  the  test 
stops  and  remaining  steps  are  failed,  for  many  MOS,  the  overall  split-half, 
calculated  using  seven  scores  against  eight  scores  (odd-even),  is  rather  low, 
but  these  may  be  underestimates  since  it  is  difficult  to  conceive  of  parallel 
forms  arranged  from  testa  of  such  heterogeneous  tasks. 


Table  III. 46 


Means,  Standard  Deviations,  and  Split-Half  Reliabilities  for 
Hands-On  Test  Components  for  Mine  MOS 


MOS 

N 

Mean 

J£L 

Standard 

Deviation 

Split-Half 

Reliability 

13B  -  Cannon  Crewman 

146 

64.5 

14.0 

.82 

64C  -  Motor  Transport  Operator 

149 

72.9 

9.1 

.69 

71L  -  administrative  Specialist 

126 

62.1 

9.9 

.66 

95B  -  Military  Police 

113 

70.8 

5.8 

.30 

liB  -  Infantryman 

162 

56.1 

12.3 

.49 

19E  -  Armor  Crewman 

106 

81.1 

11.8 

.56 

31C  -  Single  Channel  Radio  Operator 

140 

80.1 

10.7 

.44 

656  -  Light  Wheel  Vehicle  Mechanic 

126 

79.8 

8.7 

.49 

91A  -  Medical  Specialist 

159 

83.4 

11.4 

.35 

a  Calculated  as  8-test  score  correlated  with  7-test  score,  corrected  to  15 
tests. 
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Task  Performance  Rating  Scales 

Inspection  of  the  rating  data  revealed  level  differences  In  the  mean 
ratings  provided  by  two  or  more  raters  of  the  same  soldier.  Therefore,  all 
raters'  responses  were  adjusted  to  eliminate  these  level  differences.  Addi¬ 
tionally,  a  small  number  of  raters  were  Identified  as  outliers,  In  that  their 
ratings  were  severely  discrepant  overall  from  the  ratings  of  other  raters  on 
the  same  soldiers;  their  rating  data  were  excluded  from  the  analyses.  (The 
procedures  for  adjusting  the  ratings  for  level  effects  and  for  Identifying 
outliers  are  described  In  more  detail  In  the  discussion  of  measures  In 
Section  11.)  , 

Means  and  standard  deviations  were  computed  on  the  adjusted  ratings  for 
each  7-polnt  scale.  Interrater  reliabilities  were  computed  as  Intraclass 
correlations,  and  the  estimates  were  adjusted  so  as  to  represent  the 
reliability  of  two  supervisors  per  soldier  and  four  (Batch  A)  or  three  (Batch 
B)  peer  raters  for  each  soldier.  The  adjustment  was  made  because  numbers  of 
raters  varied  for  each  soldier,  and  It  seemed  reasonable  to  expect  that  rat¬ 
ings  could  be  obtained  from  two  supervisors  and  four  peers  during  the  Concur¬ 
rent  Validation.  However,  further  experience  In  Batch  B  data  collection 
suggested  that  three  peers  per  soldier  was  more  reasonable.  These  1 ssues  are 
alscussed  more  fully  In  the  sections  on  rating  scale  results. 

Summary  statistics  on  the  task  performance  rating  scales  across  the  15 
tasks  In  each  MOS  are  presented  In  Table  III. 47  (more  detailed  results  can  be 
found  In  ARI  Technical  Report  717),  The  distributions  for  the  rating  scales 
are  surprisingly  free  of  leniency  and  skewness,  with  means  between  4.3  and 
4.9  on  the  7-polnt  scale  and  standard  deviations  largely  between  .55  and  .75. 

Reliabilities  varied  widely  across  the  tasks.  In  MOS  such  as  71 L 
(Administrative  Specialist)  and  63B  (Light  Wheel  Vchlcie  Mechanic)  where 
soldiers  work  In  Isolation  from  each  other  or  with  only  one  or  two  others, 
few  peer  ratings  were  obtained  on  each  soldier  and  reliabilities  are  corre¬ 
spondingly  lower.  The  mean  number  of  peer  ratings  among  1  IB  (Infantryman) 
was  much  higher,  and  many  of  the  soldiers  being  tested  comprised  training 
cohorts  that  had  been  together  since  their  earliest  Army  training.  Some 
tasks  that  soldiers  rarely  perform  were  also  characterized  by  lower  numbers 
of  ratings  and  lower  reliabilities. 

During  the  Batch  A  field  tests,  It  was  observed  that  supervisors  and 
peers  confronted  with  only  the  task  title,  might  not  have  been  entirely  clear 
on  the  scope  of  tasks  they  were  rating.  Low  Interrater  reliability  supported 
this  observation.  Consequently,  for  the  Batch  B  data  collection  for  two  MOS 
(31C  and  19E),  the  task  statements  were  augmented  with  the  brief  descriptions 
of  the  tasks  that  had  been  developed  for  the  task  clustering  phase  of 
development.  However,  this  modification  did  not  appear  to  affect  results 
from  these  MOS  and  It  was  not  given  further  trial. 


Job  History  Questionnaire 


Job  history  responses  were  analyzed  to  determine  whether  task  experience 
as  captured  by  the  Job  History  Questionnaire  Is  related  to  performance  on  the 


Table  II I. 47 


hearts.  Standard  Deviations,  Number  of 
Supervisor  and  Peer  Ratings  Across  15 


MOS 

Group 

13B  -  Cannon  Crewman 

Sup. 

Penr 

64C  -  Motor  Transport 
Operator 

Sup. 

Peer 

71L  -  Administrative 
Specialist 

Sup. 

Peer 

95B  -  Military  Police 

Sup. 

Peer 

1 IB  -  Infantryman 

Sup. 

Peer 

19E  -  Armor  Crewman 

Sup. 

Peer 

31C  -  Single  Channel  Radio 
Operator 

Sup. 

Peer 

63B  -  Light  Wheel  Vehicle 
Mechanl c 

Sup . 
Peer 

Raters,  and  Interrater  Reliabilities  of 
Tasks  for  Nine  MOS 


Mean 

Raters 

Mean® 

Standard 

Deviation® 

Split-Half  , 
Reliability1 

1.5 

4.99 

.72 

.67 

2.5 

4.85 

.60 

.87 

1.8 

4.35 

.64 

.69 

2.C 

4.26 

.58 

.70 

1.0 

4,97 

.70 

.75 

1.9 

4.97 

.64 

.60 

1.9 

4,51 

.49 

.64 

3.4 

4.53 

.46 

.82 

1.8 

4,45 

.59 

.74 

3.0 

4.50 

.55 

.77 

1.7 

4.69 

.62 

.76 

3.0 

4.71 

.45 

.67 

1.7 

4.68 

.68 

.81 

2.5 

4.68 

.58 

.74 

1.8  . 

4.72 

.68 

.76 

2.1 

4.68 

.63 

.81 

task-specific  criterion  measures.  The  questionnaire  asked  soldiers  to  esti¬ 
mate  the  recency  and  frequency  of  performance  of  each  task.  If  sufficient 
relationships  were  found,  Job  history  data  would  also  be  collected  during  the 
Concurrent  Validation.  Because  the  Job  History  Questionnaire  data  analyses 
were  performed  solely  to  Inform  the  decision  on  whether  to  continue  collect¬ 
ing  job  history  Information,  attention  was  focused  on  detailed  analyses  of 
one  Batch  A  MOS  (13B)  and  three  Batch  B  MOS  (11B,  19E,  and  63B). 

Recency  and  frequency  were  summed  with  frequency  reverse  scored  prior  to 
summing,  so  that  high  scores  Indicate  greater  recency  and/or  frequency  of 
task  experience.  For  the  Batch  A  MOS  13B  (Cannon  Crewman),  this  summated 
experience  score  was  significantly  related,  In  the  positive  direction,  with 
test  scores  for  9  of  the  15  hands-on  tests,  and  for  9  of  the  30  knowledge 
tests.  For  six  tasks,  experience  was  significantly  related  to  performance  on 
both  knowledge  and  hands-on  tests.  These  findings  certainly  support  the  con¬ 
tinued  examination  of  job  experience  effects. 


For  the  three  Batch  B  MOS,  frequency  and  recency  were  treated  separate¬ 
ly.  For  MOS  1 1 B  ( Infantryman) ,  recency  and  frequency  or  both  correlated 
significantly,  and  in  the  appropriate  directions,  for  7  of  the  15  hands-on 
tests,  and  for  15  of  the  30  knowledge  tests.  For  six  tasks,  one  or  both 
experience  Indexes  were  related  to  both  hands-on  and  knowledge  performance. 

For  MOS,  19E  (Armor  Crewman),  experience  Indexes  were  related  to  only 
three  hands-on  tests  and  two  knowledge  tests.  For  one  19E  task,  experience 
ms  significantly  related  to  both  knowledge  and  hands-on  scores.-  For  MOS 
63B  (Light  Wheel  Vehicle  Mechanic),  experience  was  significantly  related  to 
only  two  hands-on  tests  and  five  knowledge  tests,  with  none  of  the  tasks 
having  significant  relationships  with  experience  measures  for  both  types  of 
tests.  For  soldiers  In  these  two  MOS,  experience  differences  appear  to  have 
less  influence  on  performance. 


Intercorrelations  Among  Task  Performance  Measures 

For  each  of  the  nine  MOS,  performance  on  15  tasks  was  assessed  by  four 
methods:  hands-on  tests,  knowledge  tests,  supervisor  ratings,  and  peer  rat¬ 
ings.  Thus,  a  60  X  60  correlation  matrix  could  be  generated  for  each  of  the 
MOS,  as  In  <n  mul  1 1  met  hod-mu  1  tltralt  matrix  (where  traits  are  tasks).  For 
purposes  of  simple  examination  each  MOS  matrix  was  collapsed,  by  averaging 
correlations  across  tasks,  to  a  4  X  4  matrix  (see  Figure  III. 12).  For  each 
pair  of  methods,  the  15  correlations  between  the  two  methods  on  the  same 
tasks  (heteromethod-monotrait)  were  averaged  and  entered  above  the 
diagonals.  The  210  correlations  between  each  pair  of  methods  on  different 
tasks  (heteromethod-heterotrait)  were  averaged  and  entered  below  the 
diagonals.  Finally,  the  105  correlations  between  pairs  of  tasks  measured  by 
the  same  method  (monomethod-heterotrait)  were  averaged  and  are  shown  In  the 
diagonals  of  each  matrix. 

In  general,  there  are  three  considerations  in  examining  a  full 
multimethod-multitrait  matrix:  (a)  The  heteromethod-monotral t  correlations 
(above  the  diagonals)  are  Indications  of  convergent  validity  among  the 
methods,  the  extent  to  which  the  different  methods  measure  the  same  trait 
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Figure  III. 12. 


Average  correlations  between  task  measurement  methods  on  same 
tasks  and  different  tasks  for  nine  MOS. 
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(here,  the  traits  are  proficiency  on  tasks),  (b)  These  same  validity  coef¬ 
ficients  (above  the  diagonals)  should  be  greater  than  the  corresponding 
va’ues  in  the  heteromethod-heterotrait  triangle  (below  the  diagonals),  as  an 
indication  that  the  method-trait  relationships  are  not  all  a'reflection  of 
some  otner  unspecified  factor,  (c)  The  monomethod-heterotrait  correlations 
(in  the  diagonals)  should  be  lower  than  the  coefficients  above  the  diagonals, 
as  evidence  of  discriminate  validity— that  is,  the  methods  of  measuring  tasks 
are  not  overshadowing  differences  among  tasks. 

Without  exception,  the  average  correlations  are  highest  both  between  and 
within  peer  and  supervisor  ratings,  with  method  variance  (different  tasks)  in 
general  higher  than  variance  accounted  for  by  tasks.  For  hands-on  and 

knowledge  tests,  the  average  of  same-task  correlations  between  the  two 
methods  (above  the  diagonal)  is  higher  than  either  of  the  single-method 

different-task  average  correlations  (In  the  diagonal),  which  in  turn  are 
usually  higher  than  the  average  correlation  between  the  +wo  methods  on  dif¬ 
ferent  tasks  (below  the  diagonal).  The  lower  correlations  between  the  task 
tests  and  task  ratings,  even  on  the  same  tasks  (above  the  diagonal),  further 
evidence  the  influence  of  method  variance  in  the  ratings. 

The  correlations  among  the  methods  obviously  tend  to  be  higher  when 
results  are  aggregated  across  tasks  to  the  total  score  level  (see  Figure 
III. 13).  Again,  the  correlations  between  the  two  ratings  methods  are 
highest,  and  correlations  between  rating  methods  and  test  methods  are  In 
general  lowest.  The  exceptions  are  among  MOS  <J5B  (Military  Policy)  where  the 
hands-on/knowledge  correlation  is  particularly  low,  and  MOS  11B  (Infantryman) 
wWe  ratings  and  test  results  are  correlated  nearly  as  highly  as  the  two 
test  methods. 

Table  III. 48  shows  overall  correlations  between  hands-on  and  knowledge 
tests  for  MOS  grouped  by  occupational  category.  The  categories  used  corre¬ 
spond  to  the  Aptitude  Area  composites  identified  by  McLaughlin,  Rossmeissl, 
Wise,  Brandt,  and  Wang  (1984)  based  un  which  Armed  Services  Vocational  Apti¬ 
tude  Battery  (ASVAB)  tests  were  most  predictive  of  future  training  perform¬ 
ance  success  for  particular  Army  flCS.  These  composites  were  labeled  as 
clerical,  operations,  combat,  and  skilled  technical.  The  correlations  were 
clearly  lowest  in  the  skilled  technical  category;  otherwise,  there  were  no 
major  differences  between  groupings. 

To  know  whether  these  correlations  are  high  or  low,  some  f'-ame  of 
reference  is  needed.  Rumsey,  Osborn,  and  Ford  (1585)  reviewed  15  comparisons 
between  hands-on  and  job  knowledge  tests.  For  13  of  the  19  comparisons  using 
work  samples  classified  a.  "motor"  because  the  majority  of  tasks  involved 
physical  manipulation  of  things  (see  Asher  &  Scia^rino,  1974,  for  a  distinc¬ 
tion  between  "motor"  and  "verbal"  work  samples),  they  found  a  mean  correla¬ 
tion  of  .42  prior  to  correction  for  attenuation  and  .54  following  such 
correction.  Results  were  further  divided  into  occupational  categoi  ie?  based 
primarily  on  which  aptitude  areas  in  tho  A5VAB  test  predicted  performance  for 
that  category,  a1  follows:  skilled  technical,  operations,  combat  arms,  and 
electronic.  Table  I'll. 4 1  shows  corrected  and  uncorrected  correlations  in 
each  of  these  categories.  An  additional  category,  clerical,  was  identified, 
but  no  investigations  using  a  motor  work  sample  had  reported  any  v,er,ult$  in 
this  category. 
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5  Figure  III. 13.  Reliabilities  and  correlations  between  task  measurement  methods 

►  across  task  for  nine  MOS. 
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Correlations  Between  Hands-On  and  Knowledge  Test  Components  for  MOS 
Classified  by  Type  of  Occupation 


>e  of  Occupation  CMOS' 


Total 

Sample  Size 


Correlation  Between 
Knowledge  and  Hands-On 

ra  Corrected  r13 


Clerical 

(71L  -  Administrative  Specialist) 

126 

.52 

.76 

Operations 

(63B  -  Light  Wheel  Vehicle  Mechanic; 
64c  -  Motor  Transport  Operator; 

31C  -  Single  Channel  Radio  Operator) 

393 

.43 

.71 

Combat 

(11B  -  Infantryman; 

13b  -  Cannon  Crewman; 

19E  -  Armor  Crewman) 

414 

.46 

.67 

Skilled  Technical 
(9SB  -  Military  Police; 

91A  -  Medical  Specialist) 

250 

.17 

.35 

OVERALL 

1,183 

.39 

.62 

a  Correlation  between  knowledge  and  hands-on  test  scores  averaged  across 
samples. 

h  Correlation  between  knowledge  and  hands-on  test  scores  average  across 
samples  and  corrected  for  attenuation. 
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Table  I I I. 49 

Reported  Correlations  Between  Hands-On  (Motor)  and  Knowledge  Tests 


Correlation 

r« 

Corrected13 

Operations 

.45 

.60 

Combat  Arms 

.47 

.62 

Skilled  Technical 

.58 

.67 

Electronics 

.27 

.34 

ALL 

.42 

.54 

a  Correlation  between  knowledge  and  hands-on  test  scores  averaged  across 
samples. 

b  Correlation  between  knowledge  and  hands-on  test  scores  averaged  across  samples 
and  corrected  for  attenuation. 


In  general,  the  correlations  observed  In  the  Project  A  field  tests  were 
at  a  level  consistent  with  those  found  In  the  literature.  They  were  particu¬ 
larly  consistent  for  three  MOS,  Motor  Transport  Operator,  Infantryman,  and 
Administrative  Specialist,  that  represented  three  separate  occupational 
groupings.  They  were  low  In  two  skilled  technical  occupations,  Military 
Police  and  Medical  Specialist.  This  pattern  In  the  skilled  technical  group¬ 
ings  does  not  correspond  to  findings  reported  In  the  literature  (Rumsey  et 
al.,  1985),  Since  the  Military  Police  and  Medical  Specialist  occupations 
were  also  the  MOS  for  which  qualifying  scores  on  the  Armed  Forces  Qualifying 
Test  (AFQT)  were  higher  than  In  any  of  the  other  Project  A  MOS,  there  Is  some 
reason  to  believe  that  restriction  In  range  may  have  been  a  factor  contribut¬ 
ing  to  the  lower  correlations. 

How  reliable  are  the  measures?  The  weighted  average  of  the  split-half 
reliability  estimates  shown  in  Table  1 1 1. 44  for  the  30  knowledge  tests  Is 
.80;  this  average  does  not  substantially  deviate  from  an  average  reliability 
of  .83  reported  In  the  literature  for  job  knowledge  tests  (Rumsey  et  al . 
1985). 

The  average  of  the  split-half  reliability  estimates  shown  in  Table 
HI. 46  for  tne  15  hands-on  tasks  Is  .52.  Ultimately,  a  30-task  test  will  be 
generated  for  each  MOS  based  on  the  15  tasks  for  which  both  types  of  measures 
have  been  developed  and  the  15  tasks  for  which  only  job  knowledge  tests  have 
been  developed.  Using  the  Spearman-Brown  formula,  It  can  be  estimated  that 
the  reliability  of  a  30-task  hands-on  test  would  have  been  .68,  relative  to 
an  average  value  of  .71  reported  In  the  literature  (Rumsey  et  al.,  1985). 
While  the  estimates  found  here  clearly  were  not  higher  than  those  previously 
reported,  It  should  be  noted  that  the  overall  test  development  strategy  In 
Project  A  placed  more  emphasis  on  comprehensiveness  than  on  content 
homogeneity. 
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Revision  of  Task-Specific  Performance  Measures 


In  revising  tha  hands-on  and  knowledge  tests,  the  goals  for  each  MOS  was 
a  reduction  In  knowledge  test  Items  of  25  to  40X  (depending  on  the  MOS),  and 
a  set  of  between  14  and  17  hands-on  task  tests.  Field  test  experience  Indi¬ 
cated  that  reductions  of  this  magnitude  would  meet  the  time  allotments  for 
Concurrent  Validation. 

To  make  these  reductions,  both  the  field  test  results  and  additional 
systematic  judgments  of  the  “suitability"  of  hands-on  measurement  were  used. 


Hands-On  and  Knowledge  Measures 

Developing  an  appropriate  hands-on,  or  job  sample,  measure  of  a 
particular  job  task  Is  not  always  feasible,  since  It  may  be  difficult  to 
standardize  conditions  or  obtain  the  necessary  equipment.  Compromises  may  be 
necessary  and  the  question  arises  as  to  what  effect  the  compromises  have  on 
content  validity.  To  assess  such  effects  during  the  field  tests,  the  project 
staff  Involved  in  developing  and  administering  the  hands-on  measures  rated 
the  entire  pool  of  hands-on  tests  according  to  their  suitability  for  hands-on 
measurement.  The  points  considered  were  standardization  cf  conditions, 
reliability  of  scorers,  and  quality  of  task  coverage. 

The  suitability  ratings  were  then  used  with  other  data  to  further  refine 
the  task  test  sets.  First,  If  the  hands-on  set  was  too  long  (more  than  17 
tests,  or  likely  to  run  over  3  hours)  after  revisions,  the  developers  dropped 
the  hands-on  tests  that  were  judged  least  suitable  for  hands-on  measurement, 
or  that  were  judged  suitable  but  had  very  high  correlations  with  the  corre¬ 
sponding  job  knowledge  task  test,  or  that  had  correlations  with  a  similar 
hands-on  task  test.  However,  If  dropping  such  tests  would  not  effect  a 
savings,  because  the  tests  were  not  time-consuming  or  resource  Intensive,  the 
tests  were  often  retained.  When  the  hands-on  set  comprised  14-17  of  the  best 
available  hands-on  tests,  the  set  was  considered  final. 

If,  after  revisions,  the  knowledge  test  set  had  GO  to  707  as  many  Items 
as  before,  the  tests  were  considered  feasible  for  the  2-hour  time  slot.  The 
knowledge  test  was  then  accepted  as  complete,  and  finalized  for  Proponent 
review. 

However,  If  there  were  too  many  Items,  the  strengths  and  weaknesses  of 
each  knowledge  and  hands-on  test,  were  analyzed  by  means  of  a  procedure  used 
to  assign  "flaw"  points  to  each  test.  The  flaw  point  procedure  considered 
whether  the  test  was  or  was  not  revised  after  the  field  test,  test  diffi¬ 
culty,  variability  In  scores,  reliability,  and  hands-on  suitability.  The 
points  assigned  were  considered  In  conjunction  with  an  analysis  of  the 
specific  content  overlap  between  hands-on  and  knowledge  tests.  Knowledge 
tests  were  reduced  to  Items  that  tended  not  to  overlap  with  hands-on  tests  by 
considering  first  the  more  flawed  knowledge  tests,  and  then  the  knowledge 
tests  that  were  found  to  be  redundant  with  hands-on  tests.  The  steps  In 
reduction  are  described  more  completely  In  Appendix  N  In  the  API  Research 
Mote  In  preparation. 
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The  extent  of  the  changes  made  on  the  tests,  considering  both  obtained 
data  and  informed  judgments,  was  small.  Among  common  task  tests,  judgments 
of  hands-on  suitability  resulted  In  deleting  five  tests  (Recognize  Armored 
Vehicles,  Visually  Identify  Threat  Aircraft,  Decontaminate  Skin,  Move  Under 
Direct  Fire,  Collect  and  Report  Information).  Additionally,  In  each  MOS  two 
to  five  MOS-spedflc  tasks  were  dropped  as  not  suitable  for  hands-on  test¬ 
ing.  For  most  MOS,  the  set  of  hands-on  tests,  Including  those  field  tested 
In  MOS  and  tests  judged  not  suitable,  comprised  19  to  23  tasks;  after  suit¬ 
ability  judgments  were  made,  the  hands-on  sets  were  reduced  to  15  to  19  tasks 
in  each  MOS.  Appendix  T  In  the  ARI  Research  Note  In  preparation  lists  the 
full  set  of  hands-on  tests  that  were  developed  for  all  MOS,  and  Indicates, 
for  common  tasks,  the  other  MOS  for  which  the  tasks  were  selected  and  where, 
therefore,  they  might  also  be  tested  hands-on. 

By  following  the  adjustment  steps  described,  each  MOS  was  covered  by  a 
set  of  15-17  hands-on  tests,  and  a  set  of  knowledge  Items  that  was  60  to  70 % 
of  the  set  field-tested.  The  array  of  hands-on  and  knowledge  tests  for  each 
MOS  is  summarized  In  Table  III. 50;  a  list  of  the  tests  offered  for  Proponent 
agency  review  Is  presented  In  Appendix  U  of  the  ARI  Research  Note  in 
preparation. 

Table  III .50 

Summary  of  Testing  Mode  Array  for  MOS  Task  Tests  Before  Proponent  Review 


Hands-On  and 
MOS  Knowledge 


Hands-On 
and  Reduced 
Knowledge 


Hands-On  Knowledge  Total 


Knowledge 


Hands-On  Items 


13B 

8 

9 

0 

13 

17 

177 ,181' 

64C 

8 

6 

2 

14 

16 

168 

71L 

5 

5 

5 

15 

15 

148 

95B 

15 

0 

0 

15 

15 

210 

UB 

10 

4 

1 

16 

15 

198 

19E 

10 

4 

1 

15 

15 

196 

31C 

15 

0 

0 

0 

15 

215 

63B 

15 

0 

0 

0 

15 

196 

91A 

15 

0 

0 

0 

15 

234 
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Job  Task  Ratings 


The  high  correlations  among  rating  scales,  relative  to  their  correla¬ 
tions  with  other  methods,  are  consistent  with  previous  literature.  At  this 
point  the  question  still  remains  as  to  whether  the  "method"  variance  Inherent 
In  the  ratings  represents  relevant  or  extraneous  components  of  performance. 
Interrater  reliabilities  are  sufficiently  high  to  warrant  retention  of  the 
rating  scales  for  the  Concurrent  Validation. 

Also,  findings  reported  by  Borman,  White,  and  Gast  (1985)  using  this 
same  field  test  data  set  reveal  that,  for  some  MOS,  ovr  ’l  performance 
ratings  were  more  closely  related  to  hands-on  and  Job  knowledge  tests  than 
the  task-based  ratings  examined  here.  This  raises  additional  questions  about 
whether  raters  adequately  understood  and  appropriately  used  the  task-based 
scales. 


Job  History  Questionnaire 

The  results  from  the  Job  History  Questionnaire,  although  far  from 
conclusive,  provided  sufficient  Indication  that  job  experience  may  be  an 
Important  factor  to  warrant  further  consideration  of  this  variable.  As  a 
consequence,  the  Job  History  Questionnaire  Is  being  retained  In  the  Concur¬ 
rent  Validation  data  collection.  Those  data,  with  much  larger  sample  sizes, 
will  be  used  to  Identify  which,  If  any,  task  measures  should  be  corrected  for 
the  contaminating  effects  of  differential  experience.  Furthermore,  the  rela¬ 
tionship  between*  experience  and  performance  may  vary  as  a  function  of  the 
aptitude  being  validated  and  the  difficulty  of  the  task.  Therefore,  care 
will  be  taken  regarding  the  possibility  of  Interaction  effects  as  well  as 
covariance  effects. 


Proponent  Agency  Review 

The  final  step  In  the  development  of  hands-on  and  knowledge  tests  was 
Proponent  agency  review.  Tills  step  was  consistent  with  the  procedure  of 
obtaining  Input  from  Army  subject  matter  experts  at  each  major  developmental 
stage. 

The  Proponent  was  asked  to  consider  two  questions:  (a)  Do  the  measures 
reflect  doctrine  accurately,  and  (b)  do  the  measures  cover  the  major  aspects 
of  the  job?  A  Proponent  representative  was  given  copies  of  the  measures; 
staffing  of  the  review  was  left  to  the  discretion  of  the  Proponent  agent. 

Item  changes  generally  affected  fewer  than  10ft  of  the  Items  within  an 
MOS  and  most  such  changes  Involved  the  wording,  not  the  basic  content,  of  the 
Item.  Changes  affecting  the  task  list  occurred  In  only  three  MOS.  Proponent 
comments  and  resulting  actions  taken  arc  summarized  below  for  each  of  these 
MOS. 


11B  -  Infantryman.  The  Infantry  Center  Indicated  that  the  primary 
emphasis  for  Infantry  should  be  nonmechanized.  To  that  end,  they  advised 
dropping  three  tasks:  Perform  PMSC  on  Tracked  or  Wheeled  Vehicle,  Drive 
Tracked  or  Wheeled  Vehicle,  and  Operate  as  a  Station  In  a  Radio  Net.  Two 
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tasks  field  tested  in  other  MOS  were  substituted:  Move  Over,  Through,  or 
Around  Obstacles,  and  Identify  Terrain  features  on  a  Map.  The  Center  also 
concurred  with  the  addition  of  a  hands-on  test  of  the  task,  Conduct  Surveil¬ 
lance  Without  Electronic  Devices*,  the  hands-on  test  of  Estimate  Range  was 
dropped  In  exchange. 

TIL  -  Administrative  Specialist.  The  Soldier  Support  Center,  Proponent 
for  M0$  TIL,  recommended  that  Post  Regulations  and  Directives  be  eliminated 
from  the  71L  task  list.  They  also  recommended  that  four  tasks  originally  to 
be  tested  only  In  the  knowledge  mode  be  tested  In  the  hands-on  mode  .as  well: 
File  Documents/Correspondence,  Type  a  Joint  Message  Form,  Type  a  Military 
Letter,  and  Receipt  and  Transfer  Classified  Documents.  To  allow  testing 
time  for  additions,  the  following  tasks,  originally  to  be  tested  in  both  the 
hands-on  and  knowledge  modes,  will  now  be  tested  only  In  the  knowledge  mode: 
Perform  CPR;  Put  On/Wear  Protective  Clothing;  Load,  Reduce  Stoppage,  and 
Clear  M16A1  Rifle;  and  Determine  Azimuth  with  Lensatlc  dompass.  The  71L  test 
set  was  then  composed  of  29  tasks,  14  tested  In  a  hands-on  mode. 

95B  -  Military  Police.  The  Military  Police  School,  Proponent  for  MOS 
95B,  Indicated  that  the  role  of  the  military  police  was  shifting  toward  a 
more  combat-ready,  rear  area  security  requirement,  rather  than  the  domestic 
police  role  emphasized  by  the  tasks  selected  for  testing,  They  recommended 
that  five  tasks  be  added.  Three  of  these--Nav1gate  from  One  Position  on  the 
Ground  to  Another  Position,  Call  for  and  Adjust  Indirect  Fire,  and  Estimate 
Range--had  previously  been  field  tested  with  MOS  1 IB  soldiers.  Both  hands-on 
and  knowledge  tests  for  the?e  tasks  were  added  to  95B.  Another,  Use  Auto¬ 
mated  CEOI,  had  been  field  tested  with  MOS  19E  soldiers;  th<s  task  was  added 
to  the  list  of  knowledge  tests  only.  The  final  task,  Load,  Reduce  a  Stop¬ 
page,  and  Clear  a  Squad  Automatic  Weapon,  not  previously  field  tested,  was 
also  added  to  the  list  of  knowledge  tests  only.  Four  tasks  were  dropped, 
Two--Perform  a  Wall  Search,  and  Apply  Hand  Irons--had  Initially  been  proposed 
for  both  hands-on  and  knowledge  testing.  The  remaining  two--Operate  a 
Vehicle  In  a  Convoy,  and  Establ Ish/Operate  Roadb1ock/Checkpo1nt--had  been  on 
the  knowledge  only  task  list.  The  95B  test  set  then  consisted  of  31  tasks, 
16  tested  In  a  hands-un  mude. 

In  determining  whether  any  of  these  task  list  changes  constituted  a 
major  shift  In  content  coverage,  special  consideration  was  given  to  the 
principle  applied  In  the  initial  task  selection  process  that  every  cluster  of 
tasks  be  represented  by  at  least  one  task.  What  Impact  did  the  Proponent 
changes  have  with  respect  to  this  principle?'  For  MOS  71L  and  MOS  95B,  each 
cluster  was  still  represented  after  the  Proponent  changes  had  been  Imple¬ 
mented.  For  MOS  11B,  the  deletion  of  Perform  PMCS  on  Tracked  or  Wheeled 
Vehicle  and  Drive  Tracked  or  Wheeled  Vehicle  left  one  cluster,  consisting  of 
tasks  associated  with  vehicle  operation  and  maintenance,  unrepresented. 
However,  since  It  was  the  Infantry  School's  position  that  tasks  In  this 
cluster  did  not  represent  the  future  orientation  of  the  1 IB  MOS,  this  omis¬ 
sion  was  considered  acceptable. 

A  second  condition  1r  which  strict  adherence  to  Proponent  suggestions 
was  not  necessarily  advisable  was  where  the  suggestions  could  not  be  easily 
reconciled  with  documented  Army  doctrine.  Where  conflict  with  documentation 
emerged,  the  discrepancy  was  painted  out;  If  the  conflict  was  not  resolved, 
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Items  were  deleted.  Finally,  If  Proponent  comments  seemed  to  indicate  a  mis¬ 
understanding  of  the  Intended  purpose  or  content  of  the  test  Items,  clarifi¬ 
cation  was  attempted.  The  basic  approach  was  to  continue  discussions  until 
some  mutually  agreeable  solution  could  be  found. 

Copies  of  all  tests,  reflecting  revisions  based  on  field  test  data 
adjustments  to  fit  constraints  of  Concurrent  Valldatlcn  and  changes  recom¬ 
mended  by  Proponent  agencies,  are  presented  In  Appendix  V  In  the  ARI  Research 
Note  In  preparation. 


Summary  and  Discussion 

The  results  of  the  task-based  MOS-spedflc  development  effort,  from  the 
first  perusals  of  the  MOS  task  domains,  through  task  selection,  test  develop¬ 
ment,  and  field  test  data  collection,  to  the  Proponent  review  and  the  final 
production  of  criterion  measures  for  Concurrent  Validation,  are  satisfying  at 
several  levels.  More  than  2C0  knowledge  tests  and  more  than  100  hands-on 
tests  were  developed  and  field  tested,  and  the  field  test  experience  was 
applied  to  the  production  of  criterion  measures  of  more  than  200  tasks  for 
the  nine  MOS.  The  tests  provide  broad  coverage  of  each  MOS  In  a-manner  that 
Is  both  psychometrlcally  sound  and  appealing  to  MOS  proponents. 

Initial  predictions  of  the  capability  of  Army  units  to  support  hands-on 
tests,  and  of  the  ability  of  SL1  soldiers  to  comprehend  the  knowledge  tests 
and  rating  scales,  were  largely  borne  out  during  data  collection,  '/here 
serious  /Disadjustments  had  been  made,  It  was  possible  to  make  corrections  to 
eliminate  the  problems  encountered. 

The  several  methodologies,  developed  for  defining  the  task  domains, 
obtaining  SME  judgments  of  task  criticality  and  difficulty,  selecting  tasks 
for  testing,  assigning  tasks  to  tett  modes,  and  reducing  tent  sets  to  manage-' 
able  arrays  proved  both  comprehensive  and  flexible.  The  peculiarities  of 
each  MOS  required  that  the  methods  be  adapted  at  various  points,  yet  for 
every  MOS  all  vagaries  were  dealt  with  to  the  satisfaction  of  both  developers 
and  proponents. 

l.<  general,  means  and  standard  deviations  revealed  a  reasonable  level  of 
performance  variability  on  hands-on  and  knowledge  tests.  In  one  MOS  where 
the  variability  of  hands-on  tests  was  most  limited,  Military  Police,  there 
have  been  Proponent-directed  changes  that  may  result  In  Increase/.;  variability 
In  Concurrent  Validation  testing. 

The  developmental  activities  that,  have  been  described  resulted  in  the 
preparation  of  performance  measures  to  be  administered  concurrent! y  with 
predictor  measures  In  a  large-scale  validation.  As  this  effort  Is  completed, 
a  new  set  of  task-bused  measures  will  be  developed  to  measure  performance  of 
soldiers  in  their  second  tour.  It  Is  anticipated  that  many  of  the  procedures 
used  In  developing  first-tour  measures  will  be  appropriate  for  this  new 
purpose  as  well,  although  some  revisions  will  be  needed  to  accommodate  the 
expanded  responsibilities  associated  with  secona-tour  jobs.  Work  on  develop¬ 
ing  these  revised  procedures  is  already  under  way. 
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Section  H 


FIELD  TEST  RESULTS:  MOS-SPECIFIC  RATINGS  (BARS)1 


This  section  presents  the  field  test  reliability  data  and  scale  charac¬ 
teristics  for  the  job  (MOS ) -specific  rating  scales  that  were  developed  via 
the  BARS  method.  Before  those  analyses  were  carried  cut,  the  ratings  for  the 
MOS-spedflc  EARS,  as  well  as  the  other  rating  scale  criterion  measures,  were 
adjusted  for  certain  between-rater  differences.  After  describing  these  ad¬ 
justments,  and  the  results  of  the  analyses,  the  section  closes  with  a  dis¬ 
cussion  of  the  BARS  modifications  for  use  In  the  Concurrent  Validation  study. 


Rating  Scale  Adjustments 
Differences  In  Rater  Levels 

One  problem  with  ratings  Is  that  although  raters  might  agree  on  a 
particular  ratea's  strengths  and  weaknesses  across  different  performance 
dimensions,  differences  between  raters  In  the  level  of  mean  ratings  sometimes 
appear.  Consequently,  for  the  Project  A  ratings  measures  we  decided  to 
compute  adjusted  scores  that  would  reduce  or  eliminate  the  level  differences 
between  scores  provided  by  two  or  more  raters  for  a  single  ratee.  The 
procedures  developed  to  compute  adjusted  ratings  or  scores  were  as  follows: 

a  Consider  the  score  provided  for  one  enlistee  across  all 

other  raters.  For  example,  Rater  1  gave  Enlistee  A  a  score 
of  4.0  and  Enlistee  B  a  score  of  S.O  on  Dimension  X. 

•  For  each  enlistee,  compute  the  mean  rating  across  all  other 
raters.  For  example,  If  Raters  2,  3,  and  4  evaluated  En¬ 
listee  A  on  Dimension  X,  we  computed  the  mean  rating  for 
Enlistee  A  across  these  three  raters. 

a  Compare  the  score  ^or  the  target  rater-enlistee  pair  with 

the  mean  computed  for  the  same  enlistee  across  all  other 

raters.  Use  those  values  to  compute  a  mean  difference 
score  for  the  target  rater-enlistee  pair. 

a  Repeat  this  procedure  to  compute  a  difference  score  for 

each  rater-anl istee  combination  on  each  performance 
dlmenslun. 


hne  development  of  these  tests  Is  described  In  Section  4,  Part  III. 
Section  11  Is  based  primarily  on  an  ARI  Technical  Report  In  preparation, 
Development  and  Field  Test  of  Fehaviorally  Anchored  Ratine  Scales  for  Nine 
MO 5,  by  Toquam  et  a  1 . ,  and  the  supplementary  'ARI  Research  f!ote  In  prepara¬ 
tion,  which  contains  the  report  appendixes. 


•  Weight  each  difference  score  by  the  number  of  other  raters 
evaluating  each  enlistee. 

•  For  each  rater,  compute  a  weighted  average  difference  score 
for  each  performance  dimension. 

•  Finally,  compute  an  average  dlfferJ.,-.'*  ?  ,•  -e  across  all 
performance  dimensions  for  the  r'-rc/M '  r«-.er\  Use  the 
average  difference  score  to  adjust  V*.  ratings  provided  by 
that  rater. 

These  procedures  were  used  to  compute  adjusted  scores  for  all  raters. 
Ratings  supplied  by  peers  and  supervisors  were  pooled  to  compute  adjusted 
scores. 


Identification  of  Outliers 

The  next  step  Involved  Identifying  ratings  that  appeared  unrealistic. 
Two  criteria  for  identifying  questionable  raters  were  developed.  First,  the 
correlation  between  performance  dimension  ratings  for  a  target  rater-enl 1 s tee 
pair  and  the  mean  performance  dimension  ratings  provided  by  ail  other  raters 
evaluating  that  enlistee  was  computed.  If  this  cbrreUtlon  was  -.2  or  lower 
for  any  ratee,  all  of  that  rater's  ratings  were  deleted  ->om  the  data  set. 
Second, "any 'rater  who  generated  an  average  difference  score  of  2,0  or  greater 
•was  deleted  from  the  sample. 

All  ratings  made  by  any  rater  whose  adjusted  scores  exceeded  one  or  both 
of  the  above  criteria  were  deleted.  These  ratings  were  omitted  because  a 
negative  correlation  (-.2  or  lower)  or  a  relatively  high  adjustment  score 
(2.0  In  absolute  terms)  Indicated  that  this  rater's  data  did  not  correspond 
to  Information  provided  by  other  raters  evaluating  the  same  enllstee(s).  The 
goal  for  eliminating  outliers  was  to  be  as  conservative  as  possible  by 
deleting  only  the  most  extreme  ratings.  For  each  of  the  MOS  by  rater  type 
(supervisors  or  peers)  combinations,  the  number  of  raters  deleted  ranged  from 
zero  to  seven.  Across  all  MOS  and  rater  types,  data  were  eliminated  for  only 
22  ratees. 

For  all  remaining  analyses,  we  analyzed  ratings  provided  by  supervisors 
separately  from  ratings  provided  by  peers,  using  the  adjusted  scores  computed 
for  each  rater. 


Differences  Between  Batch  A  and  Batch _B  Data  Sets 

When  the  "raw"  ratings  were  adjusted  for  level  differences  between 
raters,  using  the  procedure  described  above,  some  adjusted  scores  fell  out¬ 
side  the  actual  range  of  rating  seal'*  values.  For  example,  the  rating  scores 
for  one  performance  dimension  ranged  from  0.49  to  7.17,  In  the  analyses  con¬ 
ducted  for  Batch  A  MOS,  the  aojusted  ratings  were  allowed  to  exceed  the 
actual  scale  point  range.  For  Batch  B  MOS,  adjusted  scores  were  modified  by 
truncating  scores  that  exceeded  7.0  or  that  fell  below  1.0,  Thus,  In  the  MOS 
data  the  ratings  for  Batch  A  exceed  the  range  of  1  to  7,  whereas  rating*  for 
Batch  B  fall  within  this  range. 


Rater/Ratee  Ratio 

Assumptions  concerning  the  number  of  rr.ters  evaluating  each  soldier 
affect  the  resulting  reliability  estimate.  Generally,  the  more  raters  evalu¬ 
ating  a  soldier,  the  higher  the  estimate.  For  each  group  of  ratings,  the 
ratio  of  the  number  of  raters  to  the  number  of  ratees  was  calculated.  These 
data  are  reported  In  Table  111.51.  For  comparison  purposes,  the  table 
Includes  the  ratios  for  rating  data  computed  before  and  after  the  ratings 
Were  adjusted  and  screened..  Note  that  these  ratios  changed  very  little 
following  the  screening  process. 

Table  III. 51 

Ratio  of  Raters  to  Ratees,  Before  and  After  Screening,  for  Supervisors  and 
Peer  Ratings  on  MOS-Spetlflc  BARS 


_ MOS  _ 

13B  *■  Cannon  Crewman 

64C  -  Motor  Transport  Operator 

71L-  Administrative  Specialist 

9 SB  -  Military  Police 

118  -  Infantryman 

19E  -  Armor  Crewman 

31C  •  Radio  Teletype  Operator 

63B  -  Light-Wheel  Vehicle  Mechanic 

91A  -  Medical  Specialist 


Supervisors  Peers 


Before 

After 

Before 

After 

1.47 

1.47 

2.88 

2.52 

1.84 

1.82 

2.77 

2, 57 

1.04 

1.04 

1,90 

1,39 

1.84 

1.88 

3.67 

3.39 

1.81 

1.81 

2.99 

2.99 

1.68 

1.68 

2.95 

2.95 

1.73 

1.73 

2.49 

2,130 

1.77 

1.77 

2.03 

2.09 

1.59 

1.59 

3,10 

3.10 

For  a  majority  of  enlistees  in  each  MOS,  there  were  ratings  from  two 
supervisors.  For  the  Administrative  Specialist,  however,  only  one  supervisor 
rating  for  each  enlistee  was  obtained.  This  reflects  the  job  content  for 
most  administrative  specialists,  which  makes  It  difficult  to  obtain  very  many 
supervisor  or  peer  ratings.  These  specialists  do  tend  to  work  alone  and  for 
one  boss  on^y. 

For  peer  ratings,  the  ratio  of  raters  to  ratees  ranged  from  1.89  for 
Administrative  Specialist  ( 7 1L )  to  3.39  for  Military  Police  (95B)  with  a 
median  value  of  2.57.  We  obtained  at  least  two  peer  ratings  for  every 
enlistee  with  the  exception  of  Administrative  Specialists.  For  enlistees  In 
four  of  the  MOS — Ml  1 1 1 ary  Police  (95B),  Infantryman  ( 1  IB ) ,  Armor  Crewman 
(19E),  and  Medical  Socialist  (91A)--there  were  about  three  peer  ratings  for 
each. 


For  the  reliabilities  presented  below,  estimates  were  adjustec  so  that 
reliability  computed  for  peer  ratings  provided  for  Batch  A  HOS  samples  can  be 

Interpreted  as  the  expected  correlation  between  the  mean  ratings  provided  by 
equivalent  groups  of  four  peers.  For  Batch  B  MOS  peer  ratings,  however, 
three  rather  than  four  peers  were  assumed.  Interrater  reliability  estimates 
computed  for  supervisors  can  be  Interpreted  as  assuming  that  all  soldiers 
were  rated  by  two  supervisors. 


Descriptive  Statistics  for  MOS  BARS  Ratings 
Supervisor  and  Peer  Ratings 

Tab  1 e  III. 52  presents  the  means,  standard  deviations,  ranges,  and 
reliability  {Interrater  agreement)  of  the  rating  scores  on  each  of  the, 
Individual  MOS  BARS  scales. 

Supervisor  and  peer  ratings  yielded  similar  levels  of  reliability 
estimates.  Across  all  MOS,  median  reliability  estimates  for  supervisor 
ratings  range  from  .53  for  Infantryman  ( 11 B )  to  .66  for  Medical  Specialist 
(91A)  with  a  median  value  of  .57.  For  peer  ratings,  median  values  range  from 
.43  for  Armor  Crewman  (19E)  to  .65  for  Military  Police  (953)  with  a  median 
value  of  .55.  The  median  values  indicate  that  for  single-item  scales,  inter- 
rater  reliability  estimates  are  at  a  respectable  level.  The  reliability 
estimates  reported  we-e  adjusted  for  the  number  of  raters  for  each  ratee. 
Given  equal  numbers  of  supervisor  and  peer  raters  for  each  ratee,  the  data 
indicate  that  the  supervisor  ratings  would  be  somewhat  more  reliable  than  the 
peer  ratings, 

Supervisors  and  peers  also  provided  similar  Information  about  the  mean 
level  of  performance.  Across  .the  nine  MOS,  peers  provided  slightly  higher 
grand  mean  values  than  supervisors  in  two  MOS,  Administrative  Specialist 
( 7 1 L )  and  Infantryman  (TIB).  Supervisors  provided  slightly  higher  grand  mean 
values  than  peers  in  two  MOS,  Motor  Transport  Operator  (64C)  and  Military 

Police  (95B ) .  Mean  ratings  by  the  two  groups  were  nearly  identical  for  the 

remaining  MOS. 

Scale  Intercon olations 

A  summary  of  the  average  scale  intercorrelations  for  supervisors,  for 

peers,  and  between  supervisors  and  peers  Is  shown  In  Table  III. 53.  Average 

intercorrelations  among  performance  dimension  ratings  for  supervisors  and 
peers  are  similar.  The  greatest  difference  between  mean  intercorrelations 
for  supervisors  and  peers  occurs  for  Military  Police  (95B),  with  the  mean 
value  for  supervisors  at  .39  and  mean  value  for  peers  at  .58. 

Revision  of  the  MQS-Speclflc  BARS  for  Administration  to 
the  'ConcurreriOalldation  Sample 

Prior  t.o  the  administration  of  the  MOS-specific  rating  scales  in  the 
Concurrent  Validation  study,  the  scales  were  submitted  to  a  Proponent  review 
to  verify  that  critical  first-term  job  requirements  were  represented  in  the 
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e  III .52  (Continued) 

is.  Standard  Deviations.  Ranges,  and  Reliability 
Military  Police  (958) 


Table  111.52  (Continued) 


Table  111.52  (Continued) 


Table  111.52  (Com timed) 


e  III. 52  (Continued) 

is.  Standard  Deviations,  Ranges,  and  Reliability  Estimates  for  HOS -Specific 
Medical  Specialist  (91A)  _ 


Average  Intercorral atlons  of  M0S-$pec1f1c  BARS  for  Supervisors,  for  Peers, 
and  Between  Supervisors  and  Peers 


5 

i 

! 

! 


Between  Supervisors 
and  Peers* 


Within 

Within 

Scale  In  . 

Scale  not 

MOS  Statistic 

Supervisors 

Peers 

Common 

in  common 

13B  Cannon  Crewman 

7 

.46 

.50 

.39 

.33 

SD 

.12 

.07 

.10 

.09 

64C  Motor  Transport 

7 

.48 

.42 

.43 

.38 

Operator 

SD 

.12 

.16 

.10 

.09 

71L  Administrative 

7 

.42 

.36 

.38 

.37 

Specialist 

SD 

.14 

.11 

.11 

.12 

9B  Military  Police 

7 

.39 

.58 

.44 

.41 

SD 

.15 

.07 

.08 

.09 

1 IB  Infantryman 

,  7 

.42 

.50 

.41 

.34 

SO 

.10 

.08 

.07 

.08 

19E  Armor  Crewman 

7 

.29 

.35 

.28 

.25 

* 

SD 

.11 

.13 

.10 

.09 

31C  Radio  Teletype 

7 

.53 

.49 

.43 

.38 

Operator 

SD 

.05 

.09 

.13 

.07 

63B  Light-Wheel 

7 

.53 

.43 

.43 

.40 

Vehicle  Maintenance 

SD 

.10 

.13 

.10 

.10 

91A  Medical  Specialist 

7 

.45 

.53 

.45 

.38 

SD 

.08 

.09 

.10 

.09 

a  The  first  column  Is  the 

average  of  supervisor  x  peer 

Intercorrelation  when 

both  are  using  the  same 

rating 

scale  dimension.  The 

second  column  is  the 

average  of  all  "off  diagonal" 

Intercorrelations;  that  is,  when 

supervisor 

and  peer  are  rating  the 

same  person  but  not 

using  the 

same  rating 

scale. 
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performance  scales.  Revisions  were  made  In  the  MOS-speciflc  behavioral ly 
anchored  rating  scales,  using  results  from  the  field  test  as  well  as  input 
supplied  by  the  Proponent  review  committee. 


t' 

i. 


Revisions  Based  on  Field  Test  Data 

For  each  MOS,  the  reliability  estimates  computed  for  performance  dimen¬ 
sion  ratings  provided  by  supervisors  were  compared  with  estimates  for  dimen¬ 
sion  ratings  provided  by  peers  to  Identify  problem  dimensions.  (See  Table 
111.54  for  a  summary  of  the  median  reliability  estimates  as  well  as  the  range 
of  reliabilities  for  each  MOS.)' 

For  mpst  MOS,  there  appears  to  be  no  consistent  pattern  when  reliability 
estimates  computed  for  supervisor  ratings  are  compared  with  those  computed 
for  peer  ratings.  Within  MOS  95B  one  performance  dimension,  Providing  Secu¬ 
rity,  appeared  to  present  problems  for  both  rater  groups.,  The.  Inte^rater 
reliability  estimate  computed  separately  for  supervisors  and  peers  Is  .39. 
Therefore,  the  definition  as  well  as  the  behavioral  anchors  for  this  particu¬ 
lar  dimension  were  clarified. 

For  the  remaining  MOS-speciflc  rating  scales,  performance  dimensions 
with  low  reliability  estimates  for  supervisor  or  peer  ratings  were  Identified 
and  the  rating  scale  definitions  and  anchors  developed  for  these  dimensions 
were  reviewed.  Anchors  and  definitions  wore  revised  If  It  seemed 
appropriate. 

\ 

Table  111.55  contains  the  adjusted  and  unadjusted  grand  mean1 values  by 
MOS  and  by  rater  type.  Grand  mean  values  computed  using  adjusted  scores  cor¬ 
respond  very  highly  with  grand  mean  values  computed  using  unadjusted  scores. 
Since  very  little  leniency  or  central ^ tendency  error  Is  exhibited  In  Table 
III. 55,  no  changes  were  made  In  the  sca'les  as  the  result  of  these  data, 


Revisions  Based  on  Proponent  Review 


Following  the  Batch  B  field  test  administration,  the  nine  MOS- b peel f 1 c 
behnvlorally  anchored  rating  scales  were  each  submitted  to  a  Proponent  com¬ 
mittee  for  review.  Proponent  committee  members,  who  were  primarily  technical 
school  subject,  matter  experts  from  each  MOS,  studied  the  scales  and  made  sug¬ 
gestions  for  modifications.  For  most  MOS,  suggestions  made  by  committee 
members  were  minor  wording  changes.  For  example,  they  noted  a  problem  with 
one  of  the  anchors  in  one  Administrative  Specialist  ( 7 T L )  performance  dimen¬ 
sion,  Keeping  Records.'  The  committee  recommended  deleting  one  anchor  from 
this  dimension  because  It  described  job  duties  typically  required  of  second- 
term  personnel  only  (l.a.,  Handle  Suspense  Dates).  Therefore,  this  anchor 
was  omitted. 


For  another  MOS,  Radio  Teletype  Operator  (31C),  the  Proponent  review 
committee  noted  that  the  job  title  had  been  changed,  and  the  necessary 
changes  were  made  on  all  Concurrent  Validation  rating  forms.  The  current 
MOS-speciflc  rating  form  for  this  MOS  now  reads  "Single  Channel  Radio 
0perator--31C." 
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Table  II I. 55 

Summary  of  Grand  Mean  Values  for  Unadjusted  and  Adjusted  BARS  Ratings  by  MGS* 


Supervisors  Peers 


MOS 

Unadjusted 

Abutted 

Unadjusted 

Adjusted 

13B 

Cannon  Crawnan 

Mean 

Median 

4.39(1.13) 

4.90 

4.89(0.81) 

4.92 

4.89(0.84) 

4.97 

4,85(0.71) 

4.92 

o4C 

Motor  Transport 
Operator 

Mean 

Median 

4.92(1.02) 

5.00 

5.07(0.73) 

4.85 

4.66(0.83) 

4.78 

4.74(0.66) 

4.38 

71L 

Administrative 

Specialist 

t  i 

Mean 

Median  . 

4.56(1.13) 

4,57 

4.52(0.94) 

4.52 

4.75(0.01) 

4.79 

4.72(0.64) 

4.73 

95B 

Military  Police 

Mean 

Median 

4.59(0,75) 

4.59 

4  47(0.63) 
4.58 

4.43(0.66) 

4.41 

4.43(0.60) 

4.47 

118 

Infantryman. 

Mean 

■Median 

4.39(0.91) 

4,44 

4.45(0.70) 

4.51 

4.86(0.70) 

4.60 

4.51(0.10) 
4,65  l  ■ 

19E 

Armor  Crwnan 

Mean 

Median 

4>89(3,78) 

4.91 

4.75(0.03) 

4.79 

4.75(0.60) 

4.84 

4.76(0.56) 

4.83 

31C 

Radio  ToIntypH 
Operator 

Mean 

Mfedlan 

4,46(0.93) 

4.57 

4.68(0.86) 

4.80 

4.88(0.86) 

4.87 

4.66(0.69) 

4.64 

638 

tlght-Whe*!  Vehicle 
Mechanic 

Mean 

Medlar 

4.34(0.98) 

4.41 

4.48(0.07) 

4.59 

4.64(0.81) 

4.54 

4.47(0.73) 

4.38 

91A 

Medical  Specialist 

Mean 

Median 

4.71(0.83) 

•  1,70 

4.71(0.79) 

4.67 

4.72(0.76) 

4.70 

4.71(0.72) 

4.63 

a  Standard  deviation!  are  shown  In  parentheses 


For  one  MOS,  Military  Police  (95B),  the  committee  asked  for  more  exten¬ 
sive  changes.  Committee  members  noted  that  because  critical  incident  work¬ 
shops  were  conducted  only  iri  CONUS  locations,  a  few  requirements  of  the 
Military  Police  job  were  missing.  Incumbents  in  this  MOS  are  required  to 
provide  combat  and  combat  support  functions.  Therefore,  four  performance 
dimensions  describing  these  requirements  were  added  to  the  Military  Police 
MOS-specIf 1c  rating  scales:  Navigation  (Dimension  H);  Avoiding  Enemy  De¬ 
tection  (Dimension  I);  Use  of  Weapons  end  Other  Equipment  (Dimension  J!1;  and 
Courage  and  Proficiency  In  Battle  (Dimension  K).  Definitions  and  behavioral 
anchors  for  these  scales  had  been  developed  for  the  Infantryman  (11B)  per¬ 
formance  dimensions  rating  scales.  Proponent  committee  members  reviewed 
these  definitions  and  anchors  and  authorized  Including  the  seme  Information 
in  the  Military  Police  performance  rating  scales. 


Project  Review  and  Revision 

Following  the  Batch  B  field  test  sessions,  Project  A  staff  members 
reviewed  the  final  set  of  rating  scales.  mis  group,  the  Criterion 
Measurement  Task  Force,  was  composed  of  project  personnel  responsible  for 
developing  task-oriented  and  behavior-oriented  measures, 

Most  members  of  the  Task  Force  had  participated  in  administering  cri¬ 
terion  measures  during  the  Eatch  A  and  Batch  B  field  tests.  They  reported 
that  some  of  the  rating  scales,,  the  behavlorally  anchored  scales  In  par¬ 
ticular,  required  considerable  reading  time,  and  they  felt  that  some  raters 
were  not  reading  the  scales  thoroughly  before  making  tnelr  ratings.  The 
panel  recommended  that  the  length  of  the  behavioral  anchors  be  reduced  to 
ensure  that  all  raters  would  review  the  anchors  thoroughly  before  using  thorn 
to  evaluate  Incumbents. 

The  performance  dimension  definitions  and  scale  anchors  were  modified 
accordingly.  The  goal  was  to  retain  tha  specific  job  requirements  and 
depiction  of  ineffective,  adequate,  or  effective  performance  in  each  anchor 
while  eliminating  unnecessary  information  or  lengthy  descriptions.  Figure 
III. 14  shows  an  example  of  the  anchors  for  one  performance  dimension  in  the 
Military  Police  ( 95B )  rating  scales,  os  they  appeared  for  the  Batch  P 
administration  and  as  they  appear  for  the  Concurrent  Validation  study. 

A  complete  description  of  the  ratine  scales  administered  in  the 
Concurrent  Validation  study  Is  given  In  tne  MOS  appendixes  in  the  AR1 
Research  Note  In  preparation. 
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Figure  III. 14.  Sample  Performance  Rating  Scale  before  and  after  modifications 
for  Military  Police  (95B)  MOS-Specific  BARS. 


Section  12 


FIELD  TEST  RESULTS:  ARMY-WIDE  RATING  MEASURES1 


Analyses  of  the  field  test  data  from  the  Army-wide  rating  measures 
focused  on  (a)  distributions  of  the  ratings,  (b)  Interrater  reliabilities, 
and  (c)  intercorrelations  among  the  rating  scale  dimensions. 

Prior  to  these  analyses,  the  same  rater  adjustments  and  outlier  analyses 
were  conducted  as  were  done  for  the  MOS-speciflc  ratings  (see  Section  11).  A 
relatively  small  number  of  raters  (9  supervisors  and  46  peers  out  of  tha 
total  sampla  of  904  supervisors  and  1,205  peers)  were  Identified  as  out¬ 
liers.  Because  these  raters'  ratings  were  so  severely  discrepant  from  other 
raters'  ratings  of  the  same  target  soldiers,  their  data  were  excluded  from 
further  analyses. 


Statistics  From  Field  Test 


Distributions  of  Ratings 

Table  III. 56  presents  frequency  distributions  of  ratings  made  on  each  of 
the  seven  points  on  the  7-point  Army-wide  rating  scales,  lable  III. 57  then 
depicts  the  means  and  standard  deviations  of  selected  composite  ratings  as 
well  as  the  Overall  Effectiveness  and  NCO  Potential  scales.  Taken  together, 
findings  from  the  two  tables  suggest  that  raters  did  not  succumb  to  excessive 
leniency  (overly  high  ratings)  or  restriction-in-range  (rating  everyone  at 
about  the  same  level).  The  modal  rating  of  5  on  a  7-point  scale  and  means 
generally  between  4  and  5  seem  reasonable  in  that  we  would  expect  the  first- 
tern  performer  to  be  a  little  above  average,  because  some  percentage  of  the 
poor  performers  will  have  already  left  the  Army. 


Interrater  Reliability 

Interrater  reliability  results  appear  in  Table  III. 58.  In  general,  the 
levels  of  the  reliabilities  are  encouraging.  Intraclass  correlations  for  the 
composites  of  the  Amy-wide  behavioral  dimensions  are  almost  uniformly  in  the 
80s  (median  =  .84).  Reliabilities  of  the  individual  behavioral  scales  are 
lower  (.51-. 68,  median  -  ,F8)  but  still  respectable.  The  Overall  Effec¬ 
tiveness  and  NCO  Potential  reliabilities  are  likewise  reasonably  high 
(.47-. 82,  median  =  .65).  Regarding  the  Amy-wide  common  task  ratings, 
interrater  reliabilities  for  the  dimension  composites  are  satisfactory 
(.55-. 84,  median  =  .71),  but  not  as  high  as  the  behavioral  dimension 

composites.  Individual  common  task  scale  interrater  reliabilities  are  lower 
( . 33-. 60,  median  -  .44) . 


1  The  development  of  the  Army-wide  rating  measures  was  described  in  Section 
5,  Part  III.  Section  12  is  primarily  based  on  AKI  Technical  Report  716, 
Development  and  field  Test  of  Amy-Wide  Rating  Scales  and  the  Rater ..  Orient  a - 
tT5n  and  Training  Tr o g r am,  Ha  me  u.  FiTTaTos  and  Waite r”C.  Barman  (Ed’s .7, ~ an d 
tFe‘~  supplementary  ART- Research  Note  87-22,  which  contains  the  report 
appendixes. 
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Frequency  Distributions  (Percent)  of  Ratings  Across  the  Seven  Points  of  the 

Army-Wide  Measures4 


Scale  Points 


MOS 

1 

2 

3 

4 

5 

6 

7 

Indl 

■•dual  Army-wide  Behavioral  Dimensions 

11B 

4/2 

3/6 

13/31 

20/23 

25/31 

18/18 

11/8 

13B 

3/4 

7/6 

13/11 

17/18 

24/30 

23/23 

13/9 

19E 

2/2 

8/7 

13/15 

22/24 

28/30 

18/15 

10/6 

31C 

3/3 

9/6 

14/12 

19/18 

30/32 

16/19 

9/10 

63B 

2/2 

10/7 

16/13 

20/21 

27/31 

15/17 

9/9 

64C 

4/3 

9/6 

12/12 

19/21 

26/34 

18/18 

12/6 

71L 

2/1 

6/4 

14/12 

17/25 

26/30 

20/19 

15/8 

91A 

4/3 

9/8 

12/13 

19/22 

27/28 

18/18 

12/9 

95B 

2/2 

6/6 

15/16 

25/26 

29/30 

16/17 

8/4 

Overall 

Effectiveness 

1 1 B 

2/1 

6/2 

17/12 

24/26 

30/37 

16/19 

4/3 

13B 

2/3 

6/5 

16/8 

25/15 

24/36 

17/26 

10/8 

19E 

0/1 

4/4 

10/12 

25/28 

42/33 

15/19 

3/3 

31C 

1/1 

6/4 

15/8 

25/21 

36/37 

12/21 

4/8 

63B 

2/1 

8/4 

14/12 

29/25 

30/38 

14/15 

4/4 

64C 

3/2 

9/5 

18/7 

18/16 

30/42 

17/24 

6/3 

71L 

0/0 

7/3 

15/9 

27/31 

29/32 

20/23 

2/2 

91A 

2/1 

5/4 

16/13 

24/26 

27/35 

21/17 

5/4 

95B 

1/1 

4/3 

14/11 

23/27 

32/38 

20/19 

7/2 

NC0 

Potential 

1  IB 

9/4 

18/12 

15/18 

17/19 

22/25 

15/17 

5/4 

13B 

11/17 

5/7 

10/8 

18/12 

27/32 

17/21 

10/13 

19E 

3/6 

11/10 

13/17 

21/24 

27/26 

20/13 

5/4 

31C 

7/5 

15/6 

14/10 

16/18 

27/25 

15/27 

5/9 

63B 

6/4 

13/10 

17/14 

21/19 

21/29 

15/17 

6/7 

64C 

8/8 

6/10 

13/13 

21/20 

27/30 

16/15 

8/4 

71L 

2/0 

4/3 

11/12 

18/24 

38/39 

19/15 

10/7 

91A 

8/7 

14/10 

14/15 

15/20 

22/25 

18/16 

9/8 

95B 

6/7 

5/7 

10/15 

21/23 

25/24 

18/18 

14/7 

Individual  Army-wide  Common 

Task  Dimensions 

1  IB 

2/1 

6/4 

13/10 

18/20 

27/29 

20/23 

15/13 

13B 

3/3 

5/3 

9/8 

19/15 

28/28 

23/29 

13/14 

19E 

1/1 

3/3 

9/10 

19/22 

33/28 

22/24 

13/13 

31C 

1/1 

3/3 

9/7 

20/19 

29/30 

22/24 

16/16 

63B 

2/2 

5/3 

12/12 

20/20 

30/29 

21/23 

10/12 

64C 

3/3 

5/6 

12/11 

20/20 

28/34 

25/19 

7/6 

711. 

2/3 

6/6 

12/15 

20/19 

25/31 

26/23 

8/4 

9 1A 

2/3 

4/4 

10/10 

18/20 

27/27 

25/21 

16/14 

95B 

0/2 

2/4 

11/12 

26/25 

31/32 

20/20 

10/4 

a  in 

each  cel 

1,  the  percentage  for  supervisors 

is  on  the 

left  and 

the  per- 

centage  for 

peers  is  on  the  right. 

The  scale 

values  range  from 

Poor  ( 1 )  to 

Excellent  (7). 
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Table  1 1 1. 57 

Means  and  Standard  Deviations  of  Selected  Army-Wide  Rating  Measures* 


MOS 


1  IB 

138 

19E 

31C 

63B 

64C 

71L 

91A 

95B 

Average  Army-Wide  Behavioral  Dimensions 

Supervisors 

4.50 

4.76 

4.46 

4.59 

4.42 

4.52 

4.73 

4.56 

4.44 

(.82) 

(.90) 

(.65) 

(.88) 

(.87) 

(.91) 

(.77) 

(.95) 

(.79) 

Peers 

4.53 

4.67 

4.47 

4.54 

4.49 

4.56 

4.78 

4.57 

4.46 

(.58) 

(.75) 

(.59) 

(.76) 

(.76) 

(.71) 

(.65) 

(.81) 

(.66) 

Overall 

Effectiveness 

Supervisors 

4.47 

4.59 

4.48 

4.55 

4.38 

4.36 

4.39 

4.57 

4.56 

(1.02) 

(1.23) 

(.94) 

(1.08) 

(1.14) 

(1.22) 

(1.16) 

(1.18) 

(1.10) 

Peers 

4.62 

4.85 

4.67 

4.71 

4.52 

4.75 

4.73 

4.63 

4.64 

(.76) 

(.99) 

(.76) 

(.95) 

(.95) 

(.95) 

(.93)  . 

(.93) 

(.91) 

NC0 

Potential 

Supervisors 

3.97 

4.34 

4.26 

4. 28 

4.14 

4.30 

4.76 

4.23 

4.59 

(1.37) 

(1.55) 

(1.23) 

(1.42) 

(1.36) 

(1.37)  (1.27) 

(1.48)  (1.35) 

Peers 

4.14 

4.66 

4.23 

4.56 

4.31 

4.14 

4.76 

4.29 

4.35 

(1.08) 

(1.27) 

(1.06) 

(1.24) 

(1.18) 

(1.26) 

(.93) 

(1.27) 

(1.13) 

Average  Army-Wide  Common  Task 

Dimensions 

Supervisors 

4.87 

4.97 

5.02 

5.07 

4.87 

4.53 

4.53 

4.91 

4.70 

(.66) 

(.70) 

(.55) 

(.61) 

(.65) 

(.63) 

(.81) 

(.71) 

(.53) 

Peers 

4.96 

4.99 

4.93 

5.12 

4.84 

4.54 

4.75 

4.95 

4.63 

(.61) 

(.68) 

(.47) 

(.61) 

(.77) 

(.56) 

(.69) 

(.68) 

(.57) 

a  The  mean  Is  based  on  a  7-polnt  scale  ranging  from  Poor  (1)  to  Excellent  (7). 
The  standard  deviation  Is  shown  In  parentheses.  The  means,  standard  devia¬ 
tions,  interrater  reliabilities,  and  intercorrelations  appear  in  Appendix  E  of 
API  Research  Note  87-22,  for  each  individual  Army-wide  behavioral  dimension, 
and  in  Appendix  F  for  each  individual  Army-wide  common  task  dimension. 
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Table  I I I. 58 

Intraclass  Correlation  Coefficients  for  Selected  Aniy-Wlde  Rating  Measures 


M0S 

118 

138 

19E 

31C 

6  38 

64C  711® 

91A 

95B 

ICCs  for  Average  Behavioral  Dimensions 

Supervisors 

.82 

.81 

.86 

.83 

.84 

.84 

.81 

.85 

Peers 

.80 

.83 

.78 

.86 

.84 

.85  .82 

.86 

.88 

Mean  ICCs 

Across  Individual 

Behavioral  Dimensions 

Supervisors 

.58 

.58 

.46 

.60 

.60 

.58 

.60 

.63 

Peers 

.55 

.61 

.55 

.60 

.57 

.58  .51 

.67 

.68 

ICCs 

for  Overall  Effectiveness 

Supervisors 

.64 

.62 

.54 

.70 

.63 

.72 

.74 

.82 

Peers 

.47 

.60 

.48 

.65 

.71 

.66  .70 

.68 

.79 

ICCs  for  NC0  Potential 

Supervisors 

.74 

.61 

.53 

.71 

.63 

.68 

.64 

.68 

Peers 

.57 

.63 

.59 

.74 

.66 

.69  .60 

.69 

.68 

ICCs  for  Average 

Common  Tasks 

Supervisors 

.77 

,70 

.74 

.55 

.55 

.60 

.71 

.74 

Peers 

.78 

.72 

.67 

.64 

.84 

.65  .57 

.79 

.82 

Mean 

ICCs 

Across 

Individual  Common  Tasks 

Supervisors 

.42 

.48 

.38 

.38 

.42 

- 

.46 

.41 

Peers 

.51 

.47 

.46 

.41 

.51 

.34  .33 

.60 

.57 

a  ICCs  were  net  computed  for  71L  supervisor  raters  because  almost  all  of  the 
ratees  were  evaluated  by  only  one  supervisor. 
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Supervisor  and  peer  ratings  have  very  similar  levels  of  Interrater 
reliability.  Median  reliabilities  were  computed  for  supervisors  and  peers 
separately,  first  for  all  of  the  behavioral  dimension  entries  In  Table  111.58 
and  then  for  the  common  task  scale  reliability  values.  The  peer  ratings  are 
slightly  higher  In  average  reliability  than  those  of  the  supervisors 
(supervisors:  BARS  median  *  .66,  task  median  ■  .55;  peers:  BARS  median  ■ 
.68,  task  median  ■  .59). 

It  should  be  noted  that  the  data  In  Table  III. 58  are  Intraclass 
correlation  coefficients  (ICC)  representing  the  reliabilities  of  mean  ratings 
across  supervisors  or  peers  and,  accordingly,  are  dependent  on  the  average 
number  of  raters  per  ratee.  Just  as  adding  Items  to  a  test  Increases  Its 
reliability,  larger  rater/ratee  ratios  yield  higher  reliabilities  as  a 
function  of  the  Spearman-Brown  formula.  Considering  the  present  rater/ratee 
ratios  (about  2.8  for  peers  vs.  1.8  for  supervisors),  the  supervisor  ratings 
would  have  been  somewhat  more  reliable  than  peer  ratings  If  each  source  had 
had  the  same  number  of  raters  per  ratee. 

However,  the  coefficients  appearing  In  the  table  provide  the  appropriate 
reliability  estimates  (of  the  mean  supervisor  and  mean  peer  ratings),  because 
correlations  between  the  rating  data  and  other  variables  were  calculated 
using  the  mean  supervisor  and  mean  peer  rating  for  each  ratee.  That  Is, 
ratings  of  a  given  ratee  were  averaged  across  supervisors  and  across  peers, 
and  all  of  the  correlations  reported  here  were  computed  on  these  means. 
The  sample  size  for  each  correlation  Is  the  number  of  ratees  on  which  It  was 
calculated. 


Rating  Scale  Intercrrrelatlons 

The  Intercorrelatlons  and  the  cross-correlations  of  the  Individual 
scales  for  the  Army-wide  ratings  (supervisor)  and  the  MOS-specIflc  BARS 
ratings  (supervisor)  are  shown  In  Table  III. 59  for  all  MOS.  The  average  of 
the  wlthln-measure  scale  correlations  and  the  average  of  the  scale 
cross-correlations  are  also  shown. 

Overall,  the  scale  Intercorrelatlons  are  perhaps  not  as  high  as  are 
usually  found  for  rating  scale  Intercorrelatlons  and  they  are  certainly  lower 
than  the  Individual  scale  reliabilities.  This  is  particularly  significant 
becauee  the  scale  reliabilities  (i.e.,  the  intraclass  r)  Incorporate  rater 
differences  as  error  while  the  scale  intercorrelatlons  do  not  (I.e.,  all 
correlations  are  based  on  the  same  set  of  raters). 

In  general,  the  correlations  between  scales  taken  from  the  different 
measures  (the  cross-correlations)  are  slightly  lower  than  the  wlthln-measure 
scale  Intercorrelatlons).  However,  the  differences  are  not  as  great  as  one 
might  expect,  given  the  different  objectives  of  the  two  measures. 


Revision  of  the  Army-Wide  Scales 

As  with  tne  MOS-speciflc  BARS  scales,  experience  administering  the 
Army-wide  rating  scales  during  Batch  A  indicated  that  some  soldiers  had 
difficulty  with  the  amount  of  reading  required.  It  thus  seemed  prudent  here 
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•*<ote:  All  ratings  shown  are  supervisor  ratings.  Decimal  points  have  been  omitted  In  correlations. 


Table  111.59  (Continued) 

IntercorreJatlon  Matrixes  for  HOS-Spectflc  BARS  and  Ar*y-U1de  BARS  by  HOS 
6.  Radio  Teletype  Operator  (31C) 
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I 

I 

I  CM 
i 


i  co 

I  Mj" 


l  o  00  o 

I  CO  CM  "C" 


I 

i  r-»  cm  co 


I  VO  CO  *■*  CO  00 
I  CO  CO  CM  CM 


1  hs  er  CM  O'  M 
I  VO  CM  TJ-  CO  VD  ^r 


vf)  CO  CO  VO  ^  C?'  co 
VO  *T  ^  CO  ^  CO  CO 


I 


II 1-16? 


tlght-Wheel  Vehicle  Mechanic  (638) 


l-IM’l' 


JO 

m 


t o 
2 

£ 

© 

s 


Cl 

*o 


© 

5 

CD 


V 

M 

c 


g  I  Oi 
CD 

4) 

T3 

:jt  ! 


N 


00 


«NJ 


T> 

c 

to  tp 

• 

u 

—  M 

z  ‘-1 
0)  •> 
Q.© 
©  cc 
I  < 
to  CD 

o 

x  oi 

c  — 
4J  3P 
Cl  I 
*  >1 
t  E 

CD  < 


O 

«i  co 

D.LO 

to  • 

» 

V)  N 

lM 

ci  • 
c  to 
o  cc 

•IS 


I  CO  ( 
i  ©  i 


r**  co  i 

•e  ©  i 


I  CM  « 
I  VO  ' 


*  in  i 

*  m  i 


i  o  tp 
i  in  in 


p*.  Ci  co 

©  tp  © 


i  ci 

t  vO 


omo 

VO  VO  vo 


o  ■ 

tp  vn  ■ 


m  oi  i 
vo  v  « 


O  O  l 
in  in  i 


i  vo  io  in  i 
■  wjcmi 


oo  co  o 

IONS 


co  — •  in 
vo  vo  m 


ci  m  o  i 

co  co  i 


v 

o 


O  Cl  < 
m  « 


e 

o 


t  I  £ 


*9 

DC 


Cl  . 

u 

c 

? 

o 

w 

u 

£ 


to 

X 


s  cn  4) 

*  c  c 


«  = 


£ 


*9 

L. 

4l 


•A  O  3 
O/OO 
h-  x  a 
*n 

M  ci  e 
v-  C 

H|£' 

*n  O  L. 
C  c  Cl 
—  cw 


3 

CT4-I 
Uj  C 

t;  I 

0/  u 

^<§ 

VI 

-II 

TO 

&  81  U 

<V  »n 

tt  DD 


01  Vi 

8  £ 

*/  io 

*j  VI  3  £ 

C  VI  c  o 
4>  41  *3 

BCD 

O.TJ  >  01 

—  0  /  m-  K| 

3  -O  *-»  — 

cr  c  *  c 

UJ  —  L.  TO 

X  •-»  c*  * 

0i  V>  V.  4- 

—  >»  —  O  01 
l»  w  C  ”"m  > 

—  81-  C  O 
JC  *-  B  TO  V 

4/  to  5  —  41 
>  ©  to!  a  os 


in 

DC 

< 

CD 

a) 

*o 


i  r** 

I  vo 


I  O  «M 

i  po  co 


cm  ci 

co  vo  tp 


t  co 

i  co 


co  ai 

ro  tt 


>, 

1  CM  CM  CM 

£ 

i  id  n  in 

u 

to  to 

TO 

1  CM  m  N  CO 

©  • 

i  in  r>  in 

c 

O  M 

e 

i  o  in  ouo  h 

5  M 

i  m  n  io 

i  in  —*  ©  tp  in  o 
i  in  er  ro  cvj  v 


CO  «— I  CO 
^  i n  tp 


©  ©  © 
to  ©  tp 


ci  o  *■< 

NlflV 


to  cm  © 

CM  ©  TP 


CO  p-  CM 
CM  VO  VO 


vn  o  © 
co  m  in 


i  to  *—  o  to  ©  ro  < 

i  m  in  in  in  in  m  m 


i  rocMOCnco— 
i  ©©©©romTp© 

i  m  co  to  co  in  co  ©  i 
i  in  w'ovnv 

rs  ^  io  ifl  N  (\J  CO  O  N 

n  rs  mo  Is*  r*»  io  r«»  m  n 

CllOVKVfC.O'O'O' 

^erNNnnHHPv 


O  00  Cl 

©  vn  to 

VD  C  CM 
CM  VO  VO 

•  vo  in  Ci 

i  cm  p*-  m 


«— i  ©  J5 

in  co  n 


i  CO  CM  Cl 
■  M  <■  V 


i 

t 

o 

to 

CM 

r^. 

CM 

o 

vo 

vo 

CO 

CM 

CM 

o 

c- 

in 

in 

in 

M- 

VO 

in 

in 

© 

CO 

© 

© 

64 

36 

CTi 

vo 

^r 

© 

O 

CO 

Cl 

CM 

co 

C 

Cl 

CM 

© 

vo 

m 

*T 

in 

vo 

^r 

•O’ 

CM 

© 

© 

52 

35 

M-« 

CM 

*r 

r*. 

C 

co 

© 

co 

© 

PO 

© 

© 

© 

© 

VO 

in 

*T 

CO 

PO 

*9- 

CO 

CM 

© 

TT 

in 

CO 

TO 

© 

CM 

VO 

© 

r* 

© 

CM 

© 

P-*. 

w* 

CO 

to 

to 

vo 

VO 

in 

CM 

© 

*r 

CM 

© 

© 

o 

CO 

VO 

co 

© 

o 

cn 

© 

co 

© 

pn 

TT 

TP 

r-w 

CO 

P- 

m 

*r 

in 

vn 

in 

© 

in 

© 

© 

CO 

© 

TP 

to 

p* 

CM 

CM 

CO 

vo 

© 

© 

Ci 

TP 

© 

CM 

Cl 

VO 

CO 

vo 

^r 

in 

*p 

in 

© 

© 

IM 

© 

CO 

© 

•T 

vo 

c 

CM 

vo 

© 

r««. 

c 

r«*» 

© 

© 

o 

TT 

© 

IO 

to 

vo 

VO 

in 

TP 

co 

•r 

TT 

r  T 

Vr.> 

© 

vo 

ro 

vo 

c 

CO 

m 

in 

VO 

o 

CU 

© 

© 

C\ 

Cl 

o 

in 

p- 

VO 

in 

in 

^r 

VO 

TP 

© 

co 

© 

© 

CO 

co 

VO 

© 

© 

CO 

r-. 

Cl 

© 

© 

CM 

© 

© 

VO 

to 

r*. 

r*-., 

vo 

nr 

VO 

© 

^tr 

^3* 

© 

CO 

© 

© 

CO 

co 

to 

co 

CO 

vo 

vo 

CO 

c 

© 

o> 

Cl 

o 

TP 

in 

to 

Pm 

rv. 

in 

V 

CO 

o 

^r 

© 

•tr 

•O’ 

© 

CO 

Px. 

© 

in 

CM 

ro 

CM 

r** 

CM 

*T 

CM 

© 

o 

C 

© 

© 

TP 

CO 

© 

VO 

in 

pm. 

vn 

in 

VO 

co 

© 

© 

CO 

Pv. 

© 

i 

© 

© 

CO 

co 

CM 

VO 

© 

r^. 

^p 

o 

© 

CM 

.  CO 

in 

© 

rs» 

vo 

VO 

to 

1 — 

© 

© 

© 

© 

px. 

TT 

© 

px. 

41 

*o 


4J 

tn 

c 

vn 

41 

4J 

i 

VJ 

C 

o. 

IA 

c 

41 

•MM 

CO 

TO  IA 

4-1 

> 

MM 

3 

Cl 

U  |/l 

c 

, 

#M> 

l/l 

O*  l- 

TO  41 

S? 

4-1 

MM 

© 

UJ 

c 

41  C 

E 

u 

-*C 

41 

a*j 

a. 

p* 

41 

vn 

DC 

41 

4> 

O.M- 

o 

O 

Cl 

Cl 

U 

U 

TC  U- 

MM 

u 

<M 

c 

»— 

C 

>s 

M— 

c 

c 

41 

40 

UJ 

•»- 

tj 

c 

4-1 

J= 

"0 

TO 

>l-~ 

> 

c 

*-» 

u 

•*» 

MM 

l/l 

c 

c 

u  To 

41 

o 

MM 

m 

wul 

u 

u 

41 

41 

TO  U 

O 

o 

DC 

c 

U 

6 

© 

41 

4J 

4-1 

«_»  MM 

l 

I 

TO 

or 

O 

4> 

*3 

c 

c 

— *  vn  v- 

w 

U 

41 

u 

•m- 

4-1 

MM 

M- 

—  >x 

#*M 

41 

C 

4/  W 

o 

c 

41 

PO 

TO 

41 

ifl 

1— 

UJ 

u. 

-J  X 

X 

© 

0J 

t 

1 

1 

1 

1 

1 

1 

i  i 

l 

1 

» 

> 

< 

•1 

© 

o 

o 

UJ 

u. 

G 

H 

I 

•3 

X 

PM 

—  M  n  *f  i/MO  ®  ff>  O  -  fMCM 


2  =  “-  | 


•t  mioN{riyo-MnMin\nN 
-^‘•-*^-*-*fSlCM<MfSlCMCi|fMrM 


111-163 


Mote:  AM  ratings  shown  are  supervisor  ratings.  Decimal  points  have  been  omitted  In  correlations. 


t 


^  I 

|  I  «6 


%A 

£ 

2 

oo 

5 


«* 


l 

«*: 

■? 

3 


t 

CL 

v> 

I 

O0 

£ 


< 

© 


T  r  - 


c 

o 


© 

o' 


X) 


n 

c 

* 

X 

s 

M 

Of 

u 

u 


?  i 


Oi 

w 

C 


■g* 

w  m 


o  u| 
0)  - 
0.00 
v>g 

On  CD 


oj  ac 

0i  I 
*  ^ 
0i  C 
CD  < 


U 

o>  in 


tn  • 
t 

on  ■ 


©  • 
c  on 
O  Of 

lo 


i  cm  an 
i  in 


i  m  mr  O' 

i  N  n  v 


i  to  m  vo  vo 
i  «c*  mr  m  an 


i  in  vo  in  m  o 
i  v  in  cvj  in  c 


I  O'  Csj  ID  ^  \D  r-s 

i  on  a*/  m  an  in 


t  rsOiftONOJOi 
t  unvnmnnv 


i  co  n  md  n  o  ^ 
i  ir  w  r  un  .»i  r)  v 


i  co^funcDCOtr  vo  O 

i  aniovownininm^vo 


oofMmtnce>CM©^CM«d 

OfsrsNior>iDio^N 


i 

i 


© 

C 


Of 

an 

mi 

u 

qj 


—  C§*  2 

—  -v. 

^  <J  4 

*-  c 
or  — 

25 


5  & 


c  a 
—  0/ 
"5  <1/ 

X  ** 


in 

c 
* 
m>  Cl 
i.  to 

u  — 
<  o 


0j  C< 
u  -c 
*0  — 
■  > 
o 
u  0. 
ex  a. 


& 


o  a> 
w  x 

■?■£ 

&» 
to  o 
0)  u 
OC  CL 


on 

3 

02 

0) 

■© 


s' 

L. 

«c  an 
mr 

©  • 

c 

o  n 

i  ^ 


I  CM 
I  CO 


i  cm  m  o 
i  m  vo  an 

i  inmn  v 
i  m  o  on  on 

i  HNnno' 
i  ^  ^  on  vo  in 

i  m  h  N  o  in  co 

^  lO  OvJ  V  ID  ID  V 


I  00  CO  On  Os.  CM  VO  CO 
I  mr  on  H  V  N  ID  V 

i  cococovoro©«-*o 
i  inrmpniAnNS 

i  loirin^Hintfiom 

»  {nofnONnfWMn 
i  mvomantcrCMvoiniom 

I  MO>V  OiMCVJ  OO'MDO 

i  loioioio  w  niocvj^io 

a>HnMO'H'ticnNcvoi 

mvn*«rvo^T.o,n<ronincM\cin 


i  ocooinnNO'Nioiflinc'^ 

l  rsr^cOO'COiOiDrvvO'iDCOCO 


CM 

VO 


00 

co 


an 

an 


O 

an 


CM 

VO 


<r 

o 

r-* 

00 


N-rv^NOCO^ 
«r  id  ro  co 


oo  ao  vo  an  ^  ^ 
«cr  ««r  cn  o  «r  on  cm 


co  cm  vo 
nNtr 


*?  in  oo 

cm  cm  ro 


DCMNtf  lONOOi^Oi 
n  vroconn  vnnM 


id  on 
m  an 


oo  On 
v  on 


cm  m 
*r  tr 


ID  ID  »-«  CM  CM 
tr  »r  in  v  <n 


rscocoai  v 
my  co  cn  cm  co 


mr  co  ©  *~*  — 
anntf 


cn  oi  cm 
ncvi  v 


in  c  on 
cnj  cn  ty 


©  in  *r 
co  an  un 


n*  on  *#- 

Htr  n 


in  mrv 
cm  m 


O'  cm  on 
•— «  an 


MON 
CM  m  CM 


*-llOM( 

m  cm  *ty  < 


I  CO  pH  pN»  ID  O' 
» «cr  in  an  ^  te- 


©  © 
an  vo 


:D  on 
<r  an 


© 
an  vo 


CM  O' 
ID  VO 


©©corn'd- 
in  mf  an  «d-  oo 


CUOOd  o 
in  d  in  in  v 


o  ©  cm  pH*  cn 
vo  vo  *r 


N  O'  O  VO  CO 
vo  an  r-.  un 


©  ©  a 

vcvjtr 


^  ^  CM 

on  —  un 


un  rs  *- 
m  cm  an 


on  o  an 
«r  run 


CM  © 

rr  ro  vo 


vo  an  O 
n  ui  m 


an  vo  cm 
m  an  an 


a)  id  © 
cm  r  n 


in  cm  oh 
on  id  an 


co  ©  © 
on  vo  an 


U 

p^ 

a. 

c 

E 

3 

HP* 

C 

4-1 

m 

3  CL 

«c 

C 

01 

in 

cr— 

L. 

o 

a> 

QJ 

UJ  3 

3 

o 

s. 

u 

C 

'■n  cr 

*^n 

in 

c 

QJ 

Cj  ui 

c 

*—  0J 

>> 

ms 

ms  at 

4-1 

> 

”»H. 

~  c 

ai  u 

U  01 

3 

QJ 

L  in 

c 

O  V* 

o 

—  f0 

c  u 

pppp 

tn 

© 

V- 

id  OI 

O) 

4J 

01 

L»  — 

u.  © 

01  *Q 

•— 

© 

UJ 

C 

o>  c 

t 

U 

-c  — 

O' 

L  4-1 

oio 

-w: 

QJ 

CL  4-1 

o 

QJ 

01  •— 

TS 

O  * 

4-1  W 

u 

an 

CC 

01 

a> 

CL  — 

o 

o 

L» 

>  a 

U 

a.  o 

u  c 

01  -c 

© 

a. 

u 

u 

c  u. 

u 

W 

CL 

O 

%/t  H- 

01  oi 

■  *-* 

c 

© 

c 

c 

01 

4-J 

UJ 

T5  3 

u 

C  TJ 

CL  — 

Ui  — 

•HP 

ms 

car 

«o 

ms 

>>•“ 

> 

c 

fti  an 

01 

lA  4-1 

"0 

4-1 

u 

—  —  in 

c 

c 

u  id 

QJ 

o 

* 

ms 
ex 

Cj 

© 

<d 

t- 

tJ  0) 
> 


—  *J  -J  U  l_  oj 
c  i-  6  cn  at  *j  «  j 
.c  o  «—  Q)  *c  c  c 

U  V-  *-  *j  nj  —  — 
CJ  L.  O  C  O  tj  *5 
1-uju.h  jh 


—  O'  W  u 


=f 

I 


I  I  I 
<cou 


I  I 


o  •»  m  m  n 


u»  © 
©  © 


' 

a.  an  < 
i  i 
•—  *o  : 


QJ  © 
>  © 
I  o  sc 

l  l 
:  pm  >- 


•f  an 

I  CM  CM 


U 1-164 


Hote:  All  ratings  shown  are  supervisor  ratings.  Decimal  points  have  been  omitted  In  correlations. 


also  to  reduce  the  length  of  the  behavioral  anchors  on  the  Army-wide 

behavior -based  scales.  This  was  accomplished  by  editing  each  behavioral 
statement  to  remove  unnecessary  language  and  reduce  the  reading  difficulty. 

In  addition,  it  was  felt  that  a  few  of  the  statements  anchoring  the 
different  effectiveness  levels  were  multidimensional.  That  Is,  the  example 
behaviors  contained  In  certain  Individual  anchors  were  sufficiently  different 
to  cause  raters  potential  confusion  regarding  the  level  at  which  a  ratee 
should  be  evaluated.  This  potential  problem  was  addressed  by  extrapolating 
more  global  performance  Information  from  the  specific  behaviors  and  writing 
the  scale  anchors  to  reflect  these  more  general  performance  levels.  The 
changes  ware  similar  to  those  illustrated  for  the  MOS-specif 1c  BARS  (see 
Section  11). 

Another  revision  between  the  Batch  A  and  Batch  B  administrations  was  to 
drop  1  of  the  13  common  task  scales.  This  was  done  simply  because  a  13th 
scale  would  have  required  an  additional  page  on  the  printed  version  of  the 
scales.  The  task  dimension  that  had  the  lowest  Interrater  reliability  and 
seemed  the  most  redundant  with  others  was  eliminated  for  Batch  B  and  the 
Concurrent  Validation  effort.  The  final  version  of  these  scales,  as  well  as 
the  Army-wide  BARs,  is  shown  in  Appendix  C  of  ARI  Research  Note  87-22. 

Finally,  after  the  Batch  B  administration,  the  instruments  were 
submitted  to  Proponent  review.  In  this  review,  technical  school  subject 
matter  experts  studied  the  scales  and  made  suggestions  for  minor  wording 
changes  on  some  of  the  anchors.  Also,  the  dimension  Maintaining  Living/Work 
Areas  was  dropped  to  reduce  the  length  of  time  required  to  complete  the 
behavioral  rating  scales.  Proponent  review  experts  judged  that  dimension  to 
be  the  least  Important  and  the  most  expendable. 

In  summary,  only  minimal  changes  were  made  to  the  Army-wide  rating 
scales  as  a  result  of  the  field  tests:  first,  eliminating  one  behavioral 
dimension  and  one  common  task  dimension  to  improve  administrative  efficiency; 
second,  making  relatively  minor  wording  changes  and  reducing  the  length  of 
the  scale  anchors  to  lessen  the  reading  difficulty  as  well  as  the  time 
required  to  complete  the  scales. 


Summary  and  Conclusions 

Results  of  the  field  tests  for  the  Army-wide  measures  are  very 
encouraging.  In  particular:  (a)  rater  participants  seemed  reasonably 
accepting  of  the  rating  program  and  appeared  able  to  understand  and  comply 
with  the  instructions;  (b)  rating  distributions  were  acceptable,  with  means  a 
little  above  the  scale  midpoints  and  standard  deviations  comparable  to  those 
found  in  other  research;  and  (c)  Interrater  reliabilities  were  acceptably 
high,  for  both  supervisor  and  peer  raters. 

Although  results  from  both  batch  A  and  B  field  tests  were  on  the  whole 
positive,  valuable  information  for  improving  the  Army-wide  scales  was  gleaned 
from  these  trial  administrations.  To  obtain  the  best  possible  program,  we 
requested  that  each  batch  B  rating  session  administrator  provide  written 
feedback  on  his/her  experiences,  outlining  any  suggestions  for  possible 


program  improvement.  While  no  major  changes  were  required,  several  sugges¬ 
tions  were  made  to  facilitate  program  administration  for  Concurrent  Valida¬ 
tion  and  to  prevent  errors  In  completing  the  rating  forms. 

Because  of  the  Importance  of  rater  orientation  and  training  to  the  use 
of  ratings  as  criteria,  an  experiment  was  conducted  on  certain  parameters. 
Section  13  reports  on  that  experiment. 


Section  13 

RATER  ORIENTATION  AND  TRAINING1 


The  rater  orientation  and  training  program  was  seen  as  very  Important 
for  reaching  the  objective  of  obtaining  high-quality  ratings.  Recent  reviews 
of  research  on  rater  training  conclude  that  training  Is  likely  to  Improve 
performance  appraisals  (Lanay  &  Farr,  1980;  Zedeck  &  Casclo,  19G2).  Studies 
have  shown  that  rating  errors  such  as  halo  and  leniency  can  be  reduced  by 
appropriate  training  (Borman,  1975,  1979;  Brown,  1968;  Latham,  Wexley,  & 
Purcell,  1975).  Also,  the  accuracy  of  performance  ratings  has  been  enhanced 
using  rater  training  programs  (McIntyre,  Smith,  &  Hassett,  1984;  Pulakos, 
1984,  1986). 

Project  staff  experience  with  the  training  of  raters  suggested  that  even 
brief  rater  training  sessions  can  result  in  ratings  with  reasonaoly  good 
psychometric  characteristics.  For  example,  In  research  that  employed  5-15 
minutes  of  rater  training,  mean  ratings  have  been  between  5  and  6  on  a 
9-polnt  scale,  with  standard  deviations  between  1.25  and  2.00.  Interpretable 
factor  analyses  have  resulted,  suggesting  that  halo  was  not  overly  severe, 
and  Interrater  reliabilities  have  been  in  the  .55-. 85  range  (e.g.,  Borman, 
Rosse,  Abrahams,  A  Toquam,  1979;  Hough,  1984a;  Peterson  &  Houston,  1980). 

As  a  starting  point  for  Project  A,  a  rater  training  program  that  staff 
members  had  developed  and  revised  over  the  past  several  years  was  adapted  for 
use  In  this  project. 


Components  of  the  Initial  rater  orientation  and  training  program  were  as 
follows: 

1.  Rater  selection  guidelines  were  prepared.  Where  feasible,  super¬ 
visors  end  four  peers  were  identified  for  each  first-tour  soldier 
ratee.  To  be  eligible  to  rate  a  soldier,  the  supervisor  or  peer 
must  be  familiar  with  the  ratee's  performance  and  have  supervised  or 
worked  with  the  ratee  for  at  least  2  months. 


^ Th  1  s  section  Is  based  primarily  on  ARI  Technical  Report  716,  Development 
and  Field  Test  of  Army-Wide  Rating  Scales  and  the  Rater  Orientation  and  the 
Train  jug  Program^  Elaine’  D.  Pulakos  and  Ua I  ter  C.  Borman  (TcfsITi  and  tFe 
supplementary  ATTl  Research  Mote  87-22,  which  contains  the  report  appendixes. 
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2.  A  briefing  was  prepared  to  acquaint  participant  raters  with  the 
main  objectives  of  Project  A  and  to  explain  where  the  performance 
ratings  fit  into  the  project. 

3.  An  orientation  to  the  behavior-based  rating  scales  was  developed. 

The  principle  of  matching  observed  ratee  performance  with 

performance  described  In  the  scales'  behavioral  anchors  was 
carefully  explained  and  Illustrated  with  several  hypothetical 
examples. 

4.  A  short  program  was  aimed  at  avoiding  three  common  rating  errors: 
halo,  stereotyping,  and  paying  too  much  attention  to  one  or  two 
events  relevant  to  the  ratee1  s  performance  (heretofore  labeled  as 
the  "one  Incident  of  performance  error"). 

5.  For  practice,  peer  raters  were  asked  to  make  self-ratings  using  the 
Army-wide  behavior-based  rating  scales,  to  ensure  that  they  became 
acquainted  with  the  rating  process  before  they  began  their 
evaluations  of  other  soldiers. 

The  orientation  and  training  program  described  above  was  developed  for 
the  Batch  A  field  tests.  The  Intent  was  to  start  with  this  program,  evaluate 
Its  effectiveness  In  the  Batch  A  tests,  revise  for  Batch  B  based  on  Batch  A 
experience,  continue  the  tryout  In  Batch  B,  and  finally  revise  as  required 
for  the  large-scale  Concurrent  Validation  effort. 


Lessons  Learned  During  the  Batch  A  Field  Tests 

The  Batch  A  rater  training  and  orientation  program  seemed  quite 
successful  in  that:  (a)  it  appeared  to  flow  well  and  be  acceptable  to  both 
supervisor  and  peer  raters  (l.e.,  the  soldiers  were  generally  attentive  to 
the  program  and  appeared  to  complete  their  ratings  responsibly);  (b)  the 
Interrater  reliabilities  were  very  reasonable,  especially  In  light  of  the 
fact  that  the  peer  raters  were  Inexperienced  at  evaluating  performance;  (c) 
the  rating  distributions  were  very  reasonable,  with  no  drastic  skewing;  (d) 
the  training  effects  did  not  seem  to  be  trainer-bound  In  that  at  least  seven 
trainers  administered  the  program  at  one  time  or  another  during  the  Batch  A 
field  tests;  and  (e)  the  relationships  between  the  ratings  and  other 
criterion  variables  showed  some  predictable  patterns  (e.g.,  correlations 
between  MOS  task  scale  ratings  and  hands-on  test  scores  averaged  about  .25). 

Although  the  program  seemed  effective,  Batch  A  field  test  experience 
suggested  some  additions  and  modifications.  First,  some  supervisor  and  peer 
raters  did  appear  to  be  evaluating  all  of  their  ratees  at  approximately  the 
same  level  of  effectiveness  on  many  of  the  dimensions  (l.e.,  halo  error). 
To  counteract  this  tendency,  raters  were  subsequently  encouraged  not  only  to 
tell  us  about  each  Individual's  strengths  and  weaknesses  (thereby  avoiding 
halo  error)  but  also  to  indicate  differences  between  soldiers  who  perform 
well  In  a  particular  rating  category  and  those  who  perform  less  well  In  the 
category. 

Second,  although  error  reduction  training  Is  very  Important  In  yielding 
high-quality  evaluations,  recent  research  (McIntyre  et  a  1  . ,  1984;  Pulakos, 
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1984)  has  suggested  that  error  training  alone  may  be  insufficient  for 

Increasing  rating  accuracy,  the  crucial  criterion  for  evaluating  performance 
rating  quality  (Ilgen  &  Feldman,  1983;  Landy  &  Farr,  1980).  Therefore, 
following  the  Batch  A  field  tests  we  incorporated  a  more  comprehensive 
accuracy  training  component  Into  the  program.  Vie  stressed  that,  although  we 
did  not  want  raters  to  make  rating  errors,  the  most  Important  element  was  to 
rate  each  of  their  subordinates  or  co-workers  accurately.  Thus,  If  raters 
felt  that  their  ratees  actually  performed  at  the  same  effectiveness  level  In 
a  given  performance  category,  or  that  a  particular  soldier  performed  at 
approximately  the  same  level  across  several  categories,  then  they  were 
encouraged  to  rate  those  Individuals  In  this  way.  However,  It  was  emphasized 
that  when  real  differences  exist,  the  ratings  should  reflect  these 
differences. 

Finally,  a  question  was  ralstd  as  to  the  usefulness  of  self-ratings  as 
an  aid  In  familiarizing  raters  with  the  rating  scales.  Since  less  time  was 
to  be  available  for  ratings  during  Concurrent  Validation,  It  was  important  to 
consider  which  Instruments  and/or  aspects  of  training  might  be 
eliminated.  Toward  this  end,  an  empirical  evaluation  of  the  self-rating 
effects  would  be  useful.  No  research  to  date  has  Investigated  the  effects  of 
self-ratings  on  subsequent  ratings  of  others,  so  evaluating  this  aspect  of 
the  training  had  general  as  well  as  specific  project  Implications.  An 
experiment,  described  below,  was  designed  to  Investigate  the  self-rating 
effect  and  was  conducted  as  part  of  the  Batch  B  field  tests. 


Batch  B  Rater  Training  Experiment 

Two  training  treatments  were  evaluated  for  peers  as  raters;  (a)  rater 
orientation  and  error  reduction  training,  Including  a  brief  refresher  of  the 

error  training  points  prior  to  administering  each  new  scale;  and  (b)  this 

same  program  plus  a  self-rating  warm-up  for  each  scale. 

Parallel  training  treatments  were  also  evaluated  for  supervisors. 
However,  because  the  rating  scales  were  specifically  developed  for  evaluating 
first-term  soldier  performance,  having  the  supervisors  use  these  scales  to 
perform  a  self-rating  task  would  have  been  Inappropriate.  Consequently, 
practice  for  the  supervisors  entailed  rating  a  description  of  one 

hypothetical  soldier  prior  to  evaluating  their  subordinates.  The  two 

supervisor  training  treatments  were  thus;  (a)  rater  orientation  and  error 
reduction  training,  Including  brief  refresher  training  before  each  new 
Instrument;  and  (b)  this  same  program  plus  practice  rating  of  one 
hypothetical  soldier  on  the  Army-wide  BARS. 

The  training  treatments  for  each  peer  and  supervisor  rater  group  were 
evaluated  In  terms  of  their  effects  on  rating  accuracy  and  three  rating 
errors  (halo,  leniency/severity,  and  restriction  of  range). 


Subjects 

A  total  of  817  peer  raters  and  660  supervisor  raters  participated  in  the 
Batch  B  field  tests.  Each  soldier  represented  one  of  the  following  five  MOS: 
1 1 B  (Infantryman),  19E  (Armor  Crewman),  3 1C  (Single  Channel  Radio  Operator), 


63B  (Light  Wheel  Vehicle  Mechanic),  and  91 A  (Medical  Specialist).  Data  were 

collected  from  four  CONUS  locations  and  USAREUR. 


Rating  Instruments 

Four  of  the  rating  Instruments  used  during  the  Batch  B  field  tests  were 
relevant  for  the  present  study: 


Army-wide  behavioral  rating  scales 
Army-wide  common  task  scales 
MOS-spedflc  behavioral  rating  scales 
MOS-specIflc  task  scales 


Experimental  Treatments 


Training  Methods. 
1 ndependent  variables: 


The  following  three  training  methods  were  used  as 


Rater  Orientation 
assigned 


and  Error  Training  Only.  Peer  and  supervisor 
if  condition  received  training 


raters  assigned  to  this  experimental 
that  can  be  characterized  as  a  combination  psychometric  error  and 
frame-of-reference  program  (Bernardln  &  Pence,  1981*,  Pulakos, 
1984).  Briefly,  one  component  of  training  Involved  carefully 
explaining  the  logic  of  the  behavior-based  and  task  rating  scales, 
as  well  as  urging  raters  to  study  and  properly  use  the  Instruments 
to  arrive  at  their  evaluations.  The  second  major  component 
involved  descriptions  of  halo,  stereotyping,  one  Incident  of  per- 
formance,  and  same-level-of-effectlveness  errors  In  lay  terms  and 
provided  guidance  on  how  to  avoid  these  errors. 


Rater 

TFT?” 


Orientation  and 
experimental 


_ Error 

condi 


Training  Plus 
Ton  consisted 


of  the 

above  plus  practice  using  the  rating  scales  in 
appraisals.  Specifically,  prior  to  rating  their 
of  the  four  sets  of  scales,  peer  raters  were 
themselves  using  these  Instruments. 


Practice:  Peer 
tra i ni ng 


Raters, 

outTTned 


the  form  of  sel  f* 
co-workers  on  each 
asked  to  evaluate 


Pater  Orientation  and  Error  Training  Plus  Practice:  Supervisor 


Wters.  Supervisors  assigned  to  this  condition  also  received  the 
rater  orientation  and  error  reduction  training  discussed  above. 
However,  their  practice  entailed  evaluating  one  hypothetical  ratee 
on  the  Army-wide  behavioral  performance  dimensions.  A  vignette 
describing  performance  of  a  first-term  soldier  was  developed  for 
this  purpose,  using  behavioral  examples  from  the  pool  of  items 


retranslated  during  Army-wide  behavior  scale  development. 


Dependent  Variables.  We  were  able  to  create  "ratees"  with  known  per¬ 
formance  scores  by  developing  vignettes  about  first-term  soldiers  performing 
tnelr  jobs,  using  in  the  vignettes  previously  scaled  behavioral  examples 
(just  as  we  did  for  the  supervisor  practice  rating  condition).  The  true  or 
target  performance  level  for  a  dimension  was  simply  the  mean  retransl atlon 
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effectiveness  level  for  the  example  included  in  the  vignette  for  that 
dimension.  Four  vignettes  were  written  describi-g  performance  in  the 
following  Army-wide  areas:  (a)  Effort,  (b)  Maintain  Assigned  Equipment,  (c) 
Maintain  Living  and  Work  Areas,  (d)  Physical  Fitness,  (e)  Self-Development, 
and  (f)  Self-Control. 

By  using  expert  judges'  estimates  of  the  true  interc-welation  between 
the  six  dimensions,  along  with  dimension  means  of  4.0  and  standard  deviations 
of  1.5,  a  true  score  matrix  (see  Table  III. 60)  containing  scores  for  hypo¬ 
thetical  ratees  on  each  dimension  was  generated;  this  matrix  possessed  the 
"correct"  covariance  structure.  Using  behavioral  examples  obtained  from  the 
retranslation  phase  of  the  Army-wide  behavior  scaling  proces*,  vignettes  were 
then  written  describing  four  ratees  performing  at  the  effectiveness  levels 
shown  in  Table  III. 60.  Each  incident  had  been  allocated  reliably  into  a 
single  dimension  and  assigned  a  narrow  range  of  effectiveness  levels  In  the 
retranslation  process. 


Table  III. 60 

True  Score®  Matrix  for  Vignette  Ratees  on  Six  Army-Wide  Dimensions 


Ratees 


Dimensions 

1 

2 

3 

4 

1. 

Effort 

5 

2 

6 

4 

2. 

Maintain  Assigned  Equipment 

5 

0 

sJ 

5 

2 

3. 

Maintain  Living  &  Work  Areas 

3 

1 

5 

3 

4. 

Physical  Fitness 

4 

0 

5 

6 

5. 

Self-Development 

7 

2 

6 

4 

6. 

Self-Control 

6 

1 

2 

5 

aBecause  the  rating  task  required  evaluators  to  select  a  whole  number  from 
1  to  7  describing  each  soldier's  effectiveness  on  a  dimension,  the  gener¬ 
ated  true  scores  were  rounded  to  the  nearest  whole  number. 


After  evaluating  their  co-workers  or  subordinates  on  the  Army-wide 
behavioral  rating  scales,  both  peers  and  supervisors  read  and  then  rated  each 
of  the  four  vignettes.  The  materials  used  to  collect  these  data,  including 
instructions,  the  actual  vignettes,  and  special  rating  scales  containing  only 
the  six  dimensions  relevant  to  the  vignettes,  are  presented  in  ARI  Research 
Note  87-22. 


Using  the  peer  and  supervisor  ratings  of  the  soldiers  evaluated  (not  the 
vignettes),  the  following  four  rating  Indexes  were  computed:  interrater 
agreement,  halo,  leniency/ severity,  and  restriction  of  range.  The  vignette 
ratings  were  used  to  assess  training  effects  on  accuracy.  Each  dependent 
measure  was  examined  separately  for  peer  and  supervisory  raters. 


Procedure 

In  the  field  tests  first-term  soldiers  reported  to  their  rating  sessions 
In  groups  of  approximately  IS.  At  each  location,  one  supervisor  rating 
session  was  conducted  for  each  MOS.  Thus,  at  each  post  it  was  necessary  to 
assign  supervisor  raters  within  an  MOS  to  one  of  the  two  training  treatments 
and  then  counterbalance  the  treatment  for  each  MOS  across  the  posts.  So,  for 
example,  at  Fort  Stewart,  MOS  lSEs  and  91A$  received  error  training  only, 
while  MOS  llBs,  31Cs,  and  6 36 s  received  error  training  plus  practice. 
Conversely,  at  Fort  Lewis.,  MOS  llBs,  31Cs,  and  638s  received  error  training 
only,  while  MOS  19Es  and  91  As  received  error  training  plus  practice.  A 
similar  counterbalancing  scheme  was  used  for  the  remaining  three  locations. 
This  assignment  process  resulted  In  approximately  equal  numbers  of  soldiers 
from  each  MOS  receiving  each  type  of  training  across  the  five  data  collection 
sites. 


Results 

Interrater  Agreement 

Within  each  training  treatment,  an  Intraclass  correlation  was  computed 
for  each  dimension  of  the  four  instruments  on  which  actual  soldier 

performance  was  rated.  Within  each  Instrument,  these  correlations  were  then 
averaged  across  the  dimensions,  resulting  In  four  Indexes  of  rater  agreement 
for  each  training  condition. 

For  the  peer  and  supervisor  rater  groups.  Table  III. 61  contains  the 
average  Intraclass  correlations  tor  each  of  the  four  rating  Instruments 
within  training  condition.  To  determine  whether  there  were  significant 
differences  between  the  treatments,  chi-square  tests  were  performed. 

However,  because  the  degree  to  which  these  measures  are  actually  correlated 

was  unknown,  testing  for  differences  between  the  correlations  proceeded  as 

follows.  First,  a  minimum  chi-square,  assuming  perfect  dependency  among  the 
measures,  was  computed.  A  significant  minimum  chi-square  Indicates  a  dif¬ 
ference  between  the  two  Intraclass  correlations.  If  the  minimum  chi-square 
was  nonsignificant,  a  maximum  chi-square,  assuming  perfect  Independence  among 
the  measures-,  was  then  computed.  A  nonsignificant  maximum  chi-square  Indi¬ 
cates  no  difference  between  the  two  correlations.  The  final  possibility  was 
that  we  would  obtain  a  nonsignificant  minimum  chi-square  but  a  significant 
maximum  chi-square.  Such  a  result  would  have  Indicated  the  possibility  of  a 
difference  between  the  two  correlations,  but  no  definitive  conclusions  could 
be  drawn. 


Table  111.61 

Interrater  Reliabilities  by  Training  Condition8  Across  All  HOS 


Army -Wide 
Scales 

Army -Wide 
Common  Tasks 
Scales 

MOS  Scales 

MOS 

Tasks 

Rater  Group 

EO 

E+P 

EO 

E+P 

EO 

E+P 

EO 

E+P 

Peers 

.26 

.31 

.18 

.17 

.16 

.18 

.11 

.11 

Supervisors 

.32 

.37 

.21 

.26 

.30 

,38b 

.21 

.34*> 

a  to  *  error  training  only;  E+P  ■  error  training  plus  practice.  These  are 
one-rater  reliabilities  calculated  on  the  unadjusted  ratings. 

b  Minimum  X2  was  nonsignificant,  but  maximum  X2  was  significant. 


* 

As  shown  In  Table  111.61,  for  the  peers  there  were  no  differences 
between  the  two  training  treatments  on  any  of  the  rating  scale  types.  For 
the  supervisors.  Interrater  agreement  was  consistent  across  the  training 
treatments  for  the  Army-wide  scales,  but  practice  may  have  Increased  rater 
agreement  on  the  MOS-specIfle  scales. 

Given  that  the  supervisors'  practice  was  restricted  to  only  the  Army¬ 
wide  rating  dimensions,  the  finding  that  practice  seemed  to  facilitate  agree¬ 
ment  on  the  MOS  scales  but  not  the  Army-wide  scales  seemed  counter¬ 
intuitive.  Hence,  the  data  were  Inspected  further  to  evaluate  the  consis¬ 
tency  of  this  effect  across  MOS.  These  analyses  revealed  a  significant 
difference  between  the  two  training  treatments  on  the  MOS  scales  only  for  MOS 
91A;  there  were  no  differences  In  Interrater  agreement  as  a  result  of  train¬ 
ing  for  any  of  the  other  MOS. 


Training  Effects  on  Rater  Errors 

For  both  the  peer  and  supervisor  raters,  there  were  no  differences 
between  the  two  training  treatments  in  terms  of  halo,  leniency/severity,  or 
restrlctlon-of-range  errors  on  either  the  Army-wide  or  the  MOS-spec.1  f  1  c  rat¬ 
ing  scales. 


Training  Effects  on  Accuracy 

A  2  X  2  (Training  X  Rater  Group)  fixed-factor  analysis  of  variance  was 
conducted  to  evaluate  tralnlnu  effects  on  accuracy.  (Accuracy  was  opera¬ 
tionalized  as  the  average  squared  difference  between  the  true  scores  and  each 
rater's  observed  ratings,  with  lower  values  Indicating  greater  accuracy.) 
The  AN0V.\  results  revealed  no  significant  differences  as  a  function  of  train¬ 
ing  or  rater  group. 


Summary  and  Conclusions 

In  this  experiment  to  assess  whether  a  practice  component  of  training 
Improved  performance  rating  quality  beyond  what  was  obtained  by  error  train¬ 
ing  alone,  results  were  Identical  for  the  peer  and  supervisor  raters.  The 
practice  component  yielded  no  significant  Improvement  In  ratings  In  terms  of 
Interrater  agreement  or  any  of  the  rating  errors  assessed  here  (l.e.,  halo, 
leniency/severity,  and  restriction  of  range).  Further,  practice  did  not' 
facilitate  accuracy  on  a  vignette  rating  task. 

It  was  therefore  concluded  that  the  rater  orientation  and  training 
program  to  be  used  for  Concurrent  Validation  would  not  Include  the  additional 
peer  and  supervisor  practice  components  tiiat  had  been  tried  cut  in  the 
experimental  study. 


$ 

A 


Section  14 

FIELD  TEST  RESULTS:  COMBAT  PERFORMANCE  PREDICTION  SCALE* 


Forms  A  and  B  of  the  Combat  Performance  Prediction  Scale  were  adminis¬ 
tered  at  only  one  post  during  the  Batch  B  field  testing.  The  scale  was 
administered  to  peer  and  supervisor  raters  during  the  rating  sessions,  along 
with  the  Army-wide  and  MOS-speciflc  rating  scales.  Thus,  the  rater  training 
described  In  Section  13  preceded  administration  of  the  combat  prediction  rat¬ 
ings  as  well . 

Statistics  From  Field  Test 

Table  III. 52  presents  the  means  and  standard  deviations  of  the  combat 
effectiveness  ratings  by  rating  source,  scale  dimension,  and  combat  vs.  non¬ 
combat  MOS.  As  can  be  seen,  no  meaningful  differences  were  found  between 
supervisor  and  peer  raters,  or  combat  and  non-combat  MOS,  or  among  the  six 
scale  dimensions.  All  of  the  means  are  slightly  above  the  scale  midpoint  of 
7.5. 

Table  III, 63  presents  the  one-rater  Intraclass  correlations  for  the 
total  of  the  76  items  and  for  each  of  the  category  scores.  A  reliability  of 
.21  was  obtained  for  the  total  when  ratings  were  pooled  across  raters  and 
MOS.  This  reliability  Is  based  on  all  76  Items,  some  of  which  may  have  poor 
psychometric  properties  that  could  attenuate  the  reliability  coefficient.  In 
sum,  however,  the  Interrater  agreement  Is  disappointing,  suggesting  strongly 
that  more  Item  analysis  Is  warranted. 

Coefficient  alphas  for  the  total  76-Item  scale  as  well  as  for  each  cate¬ 
gory  are  presented  In  Table  III. 64.  The  value  ranged  from  .76  to  .88  for  the 
dimensions  and  was  .94  for  the  total. 


Revision  of  Scale  for  Concurrent  Validation 

The  Item  statistics  used  In  selecting  40  Items  for  the  final  scale  to  be 
used  1  r.  Concurrent  Validation  are  presented  in  Table  III. 65.  Items  were 
selected  on  the  basis  of  content  domain  (dimension)  coverage  and  psychometric 
properties.  The  40-Item  dimension  coverage  was  approximately  proportional 
to  the  76-item  dimension  coverage  for  the  field  test.  Psychometric  proper¬ 
ties  considered  Included  rescaled  t-valje,  reliability,  item-dimension 
correlation,  item-total  correlation,  an~3  across  MOS  and  rater  group  means  and 
standard  deviations. 


Responses  to 
applicability  were 


the  questions  concerning  rating  confidence  and  Item 
also  considered.  Total  scale  confidence  ratings  were 


slightly  above  midpoint  on  a  7 -pol n t  scale  (mean  value  of  4.25  and  4.20  for 


^Development  of  this  scale  was  described  In  Section  6,  Part  III.  Section  14 
is  based  primarily  on  an  unpublished  manuscript,  "Development  of  Combat  Per¬ 
formance  Prediction  Scale,"  by  Barry  J.  Rlegelhaupt  and  Robert  Sadacca. 

in-17 ; 


Table  III. 62 

Means  and  Standard  Deviations  tor  Rater-Ratee  Pairs  on  the 
Combat  Performance  Prediction  Scale8 

Combat  Noncomuat 

(liB,  19E)  (31C,  63B,  91A) 


01  mansion 

Items 

Peers 

(N-51) 

Supervisors 

(N-36) 

Peers 

(N*85) 

Supervisors 

(N-77) 

Cohesl  on/Cotnml  tment 

15 

7.73  (2.19) 

8,54  (2.08) 

8.70  (2.43) 

8.52  (2.37) 

Self-Discipline/ 

Responsibility 

15 

8.97  (2.42) 

9.45  (2.40) 

9.50  (2.40) 

9.55  (2.34) 

Mission  Orientation 

14 

9.22  (2.12) 

9.87  (1.96) 

9.61  (2.15) 

9.51  (2.39) 

Technical/Tactical 

Knowledge 

12 

9.12  (2.04) 

8.08  (.216) 

9.41  (2.21) 

8.78  (2.16) 

Initiative 

9 

8.36  (2.54) 

8.37  (2.43) 

8.77  (2.70) 

8.14  (2.56) 

Other 

11 

9.16  (2.20) 

8.74  (2.42) 

9.39  (2.44) 

8.94  (2.39) 

Total 

76 

8.76  (1.87) 

8.96  (1.08) 

9.23  (2.04) 

8.91  (2.02) 

8  Scale  ranged  from  1  *  Very  Unlikely  to  15  -  Very  Likely.  Standard  deviations 
are  shown  In  parentheses. 


Table  III. 63 


Intracloas  Correlations  for  Estimating  Reliabilities  for  the 
Combat  Performance  Prediction  Scale 


Pooled  Across  M0Sa 

Pooled  Across  Raters*3 

5 

Dimension 

Peers 

Supervisors 

Peers 

Supervisors 

Pooled  Across  f 

Raters  4  M0S  | 

Cohes 1 on/Comml tment 

.25 

.26 

.08 

.23 

.21  J 

Self-Discipline/ 

Responsibility 

.22 

.28 

.20 

.17 

.19  | 

Mission  Orientation 

.15 

.12 

.03 

.16 

•u 

\  Technical/Tactical 

\  Knowledge 

.19 

.19 

,11 

.17 

.15  I 

a 

!  Initiative 

.28 

.05 

.14 

.19 

.U 

!  Other 

.19 

.13 

.11 

.19 

.16  l 

Total 

.27 

.20 

.15 

.23 

.21  | 

Table  I I I. 64 

Coefficient  Alpha  for  the  Combat  Performance  Prediction  Scale 


Dimension 

Alpha 

Cohesl  on/ Commitment, 

.78 

Sel f-DI scl pi  1 ne/ResponsI bill ty 

.81 

Mission  Orientation 

.79 

Technical/Tactical  Knowledge 

.76 

Initiative 

CO 

00 

• 

Other 

.77 

Total 

.94 
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peer  and  supervisor  ratings,  respectively).  In  response  to  the  question 
about  the  number  of  applicable  items,  on  the  average  peer  and  supervisor 
raters  felt  that  about  43  Items  applied  to  the  soldiers  they  rated.  The 
frequency  with  which  particular  Items  were  judged  nonappllcable  Is  presented 
In  Table  III. 65,  In  selecting  Items  for  Concurrent  Validation,  preference 
was  given  to  items  that  raters  judged  to  be  applicable. 

The  corrected  'intraclass  correlations  for  the  final  40-ftem  scale  are 
shown  In  Table  111.66.  Vast  Improvement  (l.e,  .21  to  .56)  resulted  when  the 
40  best  Items  from  among  the  76  were  selected.  Total  scale  coefficient  alpha 
remained  at  .94. 


Table  I I I. 66 

Corrected  Intraclass  Correlations  for  Estimating  Reliabilities  of 
Best  40  Items  on  Combat  Performance  Prediction  Scale 


Rater  Group 

Form  A 

Form  B 

Across  Forms 

Across  Forms 
and  Raters 

Peers 

.55 

.66 

.56 

»  * 

Supervisors 

.78 

.63 

.68 

— 

.56 

This  40-item  scale  was  judged  to  have  sufficiently  good  psychometric 
properties  to  justify  Its  use  for  all  MOS  In  the  Concurrent  Validation  data 
collection.  Sample  pages  from  the  Instrument  are  shown  as  Figure  III. 15. 
The  scale  (e.g.,  development  of  factors)  will  be  further  refined  on  the  basis 
of  the  Concurrent  Validation  data. 


1.  This  soldier  volunteered  to  lead  a  team  to  an  accident  scene  where  immediate  first  aid  was  required  , 
before  an  order  was  given, 

Very  Fairly  About  50-50  Fairly  Very 

Unlikely  Unlikely  Cmnce  Likely  Likely 

Lina  up  the  names  lOOOOOOOOOOOO  OOO 
of  the  soldiers  *000000000000000 

you  are  rating  *000000000000000 

with  the  rows  <000000000000000 

to  the  right.  *000000000000000 


2.  Near  the  end  of  a  movement  when  soldiers  were  ordered  to  prepare  fighting  positions,  this  soldier 
prepared  his  position  quickly  and  then  assisted  other  squad  members. 

Very  Fairly  About  BO-BO  Fairly  Very 

Unlikely  Unlikely  Chance  Likely  Likely 

Line  up  the  names  <0  000000,00  000000 

of  the  soldiers  *000000000000000 

you  are  rating  *00000000  0  000000 

with  the  rows  *000000000000000 

to  the  right.  *000000000000000 


3.  This  soldier  prepared  defensive  positions  without  being  told  to  do  so. 

Very  Fairly  About  60-50  Fairly  Very 

Unlikely  Unlikely  Chance  Likely  Likely 

Line  up  the  names  <000000000000000 

of  the  soldiers  *000000000000000 

you  ere  rating  *0  00  0  00  0  00  0-  000  0  0 

with  the  rows  <000000000000000 

to  the  right.  *000000000000000 


4.  The  lattcry/compnny  commandor  Instructed  everyone  to  be  packed  and  toady  for  movement  nt 
0800  hours  This  soldier  arrived  lato  and  missed  the  movement. 

Very  Fairly  About  50-50  Fairly  Very 

Unlikely  Unlikely  Chance  Likely  Likely 

Line  up  the  names  <000000000000000 
of  the  soldiers  *000000  OOOOOOOOO 

you  ore  roting  *0000  OOOOOOOOOOO 

with  tho  rows  <0000  0  000  0  0  00000 

to  the  right.  *000000000  0  00000 


Figure  III. 15:  Sample  Items  from  Combat  Performance 
Prediction  Rating  Scale  (Page  1  of  2) 


5  When  casualties  were  to  be  evacuated  from  a  location  identified  only  by  map  coordinates,  this 
soldier  was  able  to  locate  the  site  by  accurate  navigation  in  terrain  with  few  prominent  features. 


Very 

Fairly 

About  50-50 

Fairly 

Very 

Unlikely 

Unlikely 

Chance 

Likely 

Likely 

Line  up  the  names 

>0 

0 

o 

0 

0 

0 

o 

0 

o 

o 

o 

0 

0 

o 

0 

of  the  soldiers 

2  0 

o 

o 

o 

0 

0 

o 

0 

0 

0 

0 

0 

0 

o 

0 

you  are  rating 

30 

o 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

o 

0 

with  the  rows 

<0 

0 

o 

o 

0 

0 

o 

0 

o 

o 

o 

o 

.0 

o 

0 

to  the  right. 

*0 

o 

o 

0 

0 

0 

o 

0 

0 

0 

0 

0 

0 

0 

0 

6.  This  soldier  tslkad  with  other  soldiers  who  were  having  difficulties  coping  with  the  combat  conditions. 


Very 

Fairly 

About  SO-SO 

Fairly 

Very 

Unlikely 

Unlikely 

Chance 

Likely 

Likely 

Line  up  the  names 

<0 

0 

0 

0 

0 

0 

0 

0 

0 

o 

0 

0 

0 

O 

0 

of  the  soldiers 

*0 

o 

0 

0 

0 

0 

O 

0 

0 

o 

0 

0 

0 

0 

0 

you  are  rating 

>0 

0 

0 

0 

0 

0 

O 

0 

0 

o 

0 

o 

0 

0 

0 

with  the  rows 

*0 

0 

0 

0 

0 

0 

O 

0 

0 

o 

0 

o 

0 

o 

o 

to  the  right. 

»o 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

7.  Although  the  unit  was  in  the  field  for  an  extendsd  period  of  time,  this  soldier  constantly  cleaned 
his  weapon  and  carried  additional  claaning  tuppliat  to  ba  cartain  they  were  always  available. 


Very  Fairly  About  50-50  Fairly  Very 

Unlikely  Unlikely  Chance  Likely  Likely 


Line  up  the  namea 

10 

o 

o 

0 

0 

0 

o 

o 

o 

o 

0 

0 

0 

c 

0 

of  the  soldiers 

20 

o 

o 

0 

0 

0 

o 

o 

o 

o 

0 

0 

0 

o 

0 

you  are  rating 

30 

o 

0 

0 

0 

0 

o 

0 

o 

o 

0 

o 

o 

o 

o 

with  the  rows 

40 

o 

o 

0 

0 

o 

o 

o 

0 

o 

0 

0 

0 

0 

0 

to  the  right. 

»o 

o 

o 

0 

0 

o 

o 

0 

o 

o 

0 

0 

0 

0 

0 

8.  During  movement  of  a  convoy  from  support  area  to  forward  base  camps,  this  soldier  failed  to  wear 
his  fink  jacket  and  helmet  complaining  that  they  were  too  heavy  ond  hot. 


Very  Fairly  About  50-60  Fairly  Very 

Unlikely  Unlikely  Chanco  Likely  Likoly 


Line  up  tho  names 

'O 

o 

0 

0 

0 

0 

0 

o 

0 

o 

0 

0 

o 

o 

o 

of  tire  soldiers 

20 

o 

o 

0 

0 

0 

0 

0 

o 

o 

o 

o 

o 

o 

0 

you  are  rating 

30 

o 

o 

0 

o 

o 

0 

0 

0 

o 

0 

0 

0 

o 

o 

with  the  rows 

>0 

o 

0 

0 

o 

0 

0 

0 

o 

o 

o 

0 

o 

o 

o 

to  the  right. 

*o 

0 

o 

0 

o 

o 

0 

o 

o 

o 

o 

o 

o 

0 

o 

Figure  III. 15. 


Sample  Items  from  Combat  Performance 
Prediction  Rating  Scale  (Page  2  of  2) 


1 1 1-183 


Section  15 


FIELD  TEST  RESULTS:  ARCHIVAL/ADMINISTRATIVE  INDICATORS^ 


The  Personnel  File  Information  Form  (a  self-report  of  201  File  Infor¬ 
mation),  which  had  been  developed  for  use  In  Batch  A  field  testing,  was 
administered  to  every  soldier  at  every  data  collection  location.  For  each 
soldier  tested,  project  staff  also  requested  his  or  her  201  File.  Using  the 
same  form  that  soldiers  completed  during  the  testing  sessions,  project  staff 
extracted  administrative  measures  Information  from  each  soldier's  personnel 
record,  thus  making  possible  a  comparison  of  the  two  approaches  to  collecting 
personal  Information.  A  revised  self-report  form  was  tried  out  during  the 
Batch  B  field  test. 


Results  From  Batch  A 


Only  soldiers  for  whom  both  self-report  and  201  File  Information  were 
available  were  retained  for  these  analyses.  For  a  small  number  of  cases, 
self-report  data  were  missing.  More  often,  file  data  were  missing  because  a 
soldier  did  not  grant  us  permission  to  view  his/her  201  File,  or  for  a 
variety  of  other  reasons.  Thus,  although  data  were  collected  on  548  soldiers 
during  Batch  A  field  tests,  only  505  cases  were  available  for  administrative 
measures  analyses. 


Self-Report  vs.  File  Data 

Tables  III .67-11 1 .72  show  comparisons  of  Information  obtained  from  self- 
reports  and  201  File  extraction.  .  Sample  sizes  below  505  reflect  missing 
data,  from  one  or  both  sources. 

For  the  Number  of  Awards  variable,  as  can  be  seen  In  Table  III. 69,  there 
was  perfect  correspondence  between  the  two  sources.  For  the  other  measures, 
which  showed  varying  levels  of  agreement  ( 1 .e . ,  off-diagonal  cases),  a 
greater  percentage  of  cases  consistently  fell  below  the  diagonal.  That  Is, 
soldiers  were  reporting  more  occurrences  of  administrative  measures  being 
received  than  were  found  In  their  2C1  Files. 

This  situation  was  not  surprising  In  light  of  the  knowledge  acquired  In 
our  earlier  exploration  of  201  Files.  According  to  regulations,  not  all 
letters,  certificates,  Articles  15,  and  so  forth,  are  placed  In  201  Files, 
and  some  documents  are  removed  after  a  certain  period  of  time.  Also,  while 
201  Files  are  the  most  timely  official  source  of  Information,  they  are  cer¬ 
tainly  not  updated  daily.  Thus,  discrepancies  In  the  reported  direction  were 
not  unexpected. 


^Development  of  these  Indicators  was  described  In  Section  7,  Part  III. 
Section  15  Is  based  primarily  on  an  unpublished  manuscript,  "Army-Wide 
Administrative  Measures,"  by  Carry  J.  Rlegelhaupt. 


Table  I I I. 67 


Comparison  of  Reenlistment  Eligibility  Information  Obtained  From 
Self-Report  and  201  Files:  Batch  A 


201  File 


Self-Report 

Eligible 

Ineligible 

Total 

Eligible 

293 

45 

333 

Ineligible 

35 

21 

56 

«— j f)p|> 

Total 

328 

66 

Table  III.68 

Comparison  of  Promotion  Rate*  Information  Obtained  From 
Self-Report  and  201  Files:  Batch  A 

201  File 


'-Report 

0 

.5 

1.0 

1.5 

2.0 

2.5 

3.0 

3.5 

4.0 

4.5 

Total 

0 

3 

0 

1 

2 

3 

1 

0 

0 

0 

0 

10 

.5 

0 

9 

0 

4 

5 

1 

0 

0 

0 

0 

19 

1.0 

0 

3 

34 

6 

10 

0 

1 

0 

0 

0 

54 

1.5 

1 

2 

7 

63 

26 

2 

2 

1 

1 

0 

105 

2.0 

0 

0 

0 

7 

133 

14 

4 

2 

1 

0 

161 

2.5 

0 

0 

1 

2 

15 

48 

2 

0 

0 

1 

69 

3.0 

0 

0 

1 

1 

2 

4 

20 

0 

0 

0 

28 

3.5 

0 

0 

0 

0 

0 

0 

3 

1 

0 

0 

4 

4.0 

0 

0 

0 

1 

0 

1 

0 

0 

0 

0 

2 

7.5 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

1 

8.0 

-1 

_0 

_0 

_1 

1 

_0 

_0 

_0  . 

_0 

_0 

2 

Total 

4 

14 

44 

37 

196 

71 

32 

4 

2 

1 

455 

aGrades  advanced/year. 


Ill - 185 


201  File 


Self-Report  0  12  3  4  5  6  Total 


178 

9 

2 

1 

0 

190 

80 

20 

3 

1 

0 

0 

0 

104 

60 

21 

6 

0 

1 

0 

0 

88 

38 

11 

6 

3 

0 

0 

0 

58 

24 

8 

5 

4 

1 

0 

1 

43 

7 

4 

1 

0 

1 

0 

0 

13 

5 

1 

0 

1 

1 

0 

0 

8 

Table  III. 71 


Comparison  of  Articles  15/FLAG  Information  Obtained  From 
Self-Report  and  201  Files:  Batch  A 


201 

File 

0 

1 

2 

3 

Total 

0 

320 

10 

2 

0 

332 

1 

73 

6 

4 

0 

83 

2 

38 

13 

2 

1 

54 

3 

18 

8 

1 

0 

27 

4 

2 

1 

1 

1 

5 

5 

1 

1 

0 

0 

2 

6 

1 

0 

0 

0 

1 

7 

1 

_0 

_0 

_0 

1 

Total 

454 

39 

10 

2 

505 

Table  III.72 

■ 

Comparison  of  Military  Training  Information  Obtained  From 
Self-Report  and  201  Files:  Batch  A 

201  File 


Sel  f-Report 

No 

Yes 

Total 

No 

281 

15 

296 

Yes 

188 

21 

209 

Tot.al 

469 

36 

505 

The  Intent  of  this  comparison  was  to  be  able  to  address  the  accuracy  of 
self-report.  However,  the  information  contained  in  201  Files  does  not 
necessarily  represent  "truth,"  so  determining  the  accuracy  of  self-report 

became  an  even  greater  challenge.  If  soldiers  had  reported  more  positive 
documents,  such  as  letters  and  certificates,  and  fewer  negative  documents, 
such  as  Articles  15,  when  compared  with  the  file  data,  than  the  self-report 
data  would  surely  be  suspect.  However,  soldiers  reported  receiving  more 

negative  as  well  as  more  positive  documents.  An  exception  was  Reenlistment 
Eligibility,  which  unlike  the  other  administrative  variables  are  constantly 
subject  to  change.  For  example,  If  a  soldier  Is  not  eligible  because  of 
being  overweight,  but  subsequently  loses  weight,  he/she  once  again  becomes 
eligible.  In  light  of  the  time  lag  associated  with  updates  to  Reenllstment 

Eligibility  status  In  201  Files,  a  number  of  off-diagonal  cases  would  be 

expected,  and  It  would  be  Impossible  to  predict  whether  these  deviations 
would  lie  above  or  below  the  diagonal.  In  view  of  the  above  results,  It. 
seems  likely  that  soldiers  were  honestly  responding  to  the  questions. 


Correlations  With  Rating  Variables 

Tables  III. 73  and  III. 74  present  the  correlations  between  the  six 
administrative  measures  and  Army-wide  supervisor  and  peer  ratings,  respec¬ 
tively.  Correlations  are  shown  for  both  the  self-report  and  the  201  File 
data.  As  can  be  seen,  relationships  between  Army-wide  ratings  and  the 
administrative  measures  obtained  from  the  self-report  approach  were  generally 
higher  than  those  obtained  from  201  Files. 


Conclusions 

While  the  self-report  method  In  the  Batch  A  field  test  yielded  enhanced 
variance  and  stronger  relationships  with  other  measures,  and  was  easier  and 
less  expensive  to  use,  the  selection  of  self-report  over  file  extraction  was 
still  premature.  The  Batch  B  field  test  was  used  to  provide  additional 
Information. 

A  number  of  revisions  were  made  In  the  self-report  at  tills  point.  The 
Military  Training  Courses  variable  was  dropped  from  consideration  because  it 
had  little  variance  and  showed  very  low  relationships  with  other  measures. 
Further,  recall  that  in  the  earlier  201  Flle-EMF  comparison,  almost  perfect 
agreement  between  the  two  sources  was  found  for  the  Promotion  Rate  and 
Reenllstment  Eligibility  variables.  Since  monthly  updates  of  the  EMF 
subsequently  became  available,  there  no  longer  was  a  need  to  collect  this 
information  from  the  field.  Therefore,  the  Reenllstment  Eligibility  question 
and  three  questions  used  to  compute  Promotion  Rate  were  dropped  from  the 
Personnel  File  Information  Form. 


Procedural  Changes  for  Batch  B 

The  goal  iri  Batch  B  field  testing- -to  Improve  the  correspondence  between 
Information  extracted  from  201  Files  and  that  obtained  from  soldiers'  self- 
reports— was  to  be  accomplished  by  shortening  the  form,  and  by  having  session 
administrators  "walk"  the  soldiers  through  the  questions,  explaining  whIJi 
things  should  or  should  not  be  counted  in  responding  tj  certain  items. 
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Table  III. 74 

Correlations  Between  Army-Wide  Peer  Ratings  and 
Administrative  Measures:  Batch  A 


For  example,  upon  completing  Head  Start  a  soldier  receives  a  certificate; 
this  Is  not  a  certificate  of  appreciation,  commendation,  or  achievement  and 
thus  should  not  be  counted  when  responding  to  "How  many  certificates  have  you 
received?"  These  oral  Instructions  were  expected  to  reduce  some  of  the 
discrepancies  found  in  Batch  A  results. 

To  further  Investigate  why  self-report  differed  from  file  information, 
staff  personnel  conducted  an  outlier  analysis  by  talking  with  individual 
soldiers,  trying  to  determine  the  extent  to  which  they  were  counting  the 
items  that  we  intended  to  be  counted.  To  the  extent  that  the  soldier  was 
interpreting  the  question  as  we  Intended,  we  then  asked  for  possible 
explanations  as  to  why  a  self-reported  item  was  not  found  in  the  201  File. 


Results  From  Batch  B 

Tables  III. 75-111,78  present  Batch  B  comparisons  of  Information  obtained 
from  self-report  and  file  extraction.  As  before,  a  greater  percentage  of 
non-matches  were  found  below  rather  than  above  the  diagonal.  That  Is,  once 
again  soldiers  were  reporting  that  they  had  received  more  letters,  Articles 
15,  and  so  forth,  than  were  found  In  their  201  Files. 

This  time  Information  as  to  the  possible  causes  of  the  discrepancies  was 
available  from  the  interviews  that  had  been  conducted  as  part  of  the  outlier 
comparison.  The  most  frequently  expressed  explanations  are  presented  in 
Figure  III. 16.  Some  of  the  reasons  confirmed  earlier  suspicions,  such  as 
"Counted  training  certificates,"  "Counted  certificate/letter  that  accompanied 
award,"  and  "Recently  received,  paperwork  not  completed,"  Other  reasons  were 
unexpected,  such  as  "Counted  Levy  alert"  as  a  FLAG  action;  a  Levy  alert  Is 
a  notification  of  an  Impending  transfer. 

These  intarvlews  provided  much-needed  Information.  The  lesson  learned 
was  a  simple  one:  For  the  Concurrent  Validation  data  collection  the 
self-report  questions  needed  to  be  more  detailed,  and  even  more  clearly 
specified. 


Revisions  for  Concurrent  Validation 

After  the  two  field  tests  of  the  Personnel  File  Information  Form,  three 
conclusions  were  drawn.  First,  self-report  yields  the  most  timely  data. 
Second,  self- report  yields  more  complete  data.  Finally,  as  mentioned  above, 
the  questions  needed  to  more  detailed. 

Acting  on  this  knowledge,  we  developed  a  short,  simple,  but  more 
detailed  records  report  form.  The  resulting  Personnel  File  Information  Form 
(Form  7)  is  shown  in  Figure  III. 17. 

Form  7  is  being  used  as  a  self-report  instrument  during  Concurrent 
Validation  data  collection,  and  the  information  obtained  will  be  combined 
with  the  Promotion  Rate  and  Reenlistment  Eligibility  variables  obtained  from 
the  EMF. 


Table  III. 77 

Comparison  of  Articles  15/FLAG  Information  Obtained  From 
Self-Report  and  201  Files:  Batch  B 


Self-Report 

0 

1 

201  File 

2  3 

4 

5 

Total 

0 

93 

1 

0 

0 

0 

0 

94 

1 

13 

1 

0 

0 

0 

0 

14 

2 

4 

2 

1 

0 

0 

0 

7 

3 

3 

0 

0 

0 

0 

0 

3 

4 

0 

0 

0 

0 

0 

0 

0 

5 

_0 

_0 

_0 

_0 

_0 

_l 

1 

Total 

113 

4 

1 

0 

0 

1 

119 

Table  III.78 

Comparison  of  H16  Qualification  Information  Obtained  From 

Self-Report,  and  201  Files:  Batch  B 

201 

File 

Sel f-Report 

Missing 

MKM 

SPS 

EXP 

Total 

Missing 

14 

0 

0 

0 

14 

MKM 

3 

8 

2 

0 

13 

SPS 

6 

23 

14 

3 

46 

EXP 

_4 

14 

15 

13 

46 

Total 

27 

45 

31 

16 

119 

9 

I 

6 
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Explanations  Given  For  Discrepancies 


•  Self-report  Is  correct 

•  Recently  reviewed,  paperwork  not  complete 

•  Received  while  at  previous  assignment 

•  Counted  training  certificates 

•  Counted  certificate/letter  that  accompanied  award 

•  Counted  promotions 

•  Company  level 

•  Didn't  understand  difference  (e.g.,  between 
Articles  15  and  FLAG  actions) 

•  Has  been  removed 

•  Counted  Levy  alert 

•  Not  worth  any  points 

•  Never  forwarded  to  file 

•  Outdated  Information 

•  Knows  of  discrepancy  and  trying  to  correct  It 

•  Might  be  In  restricted  file 

•  What  do  you  m< an  it's  not  there? 


Figure  III. 16.  Re^’ts  of  outlier  comparison  from  Self-Report  information 
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DATE 


□□□□DO 


OAV  MON  TV  VIAS 


MARKING  INSTRUCTIONS 


Use  only  ft  No.  2  block  load  pencil. 

Read  ftoch  quoation  carofulty.  Make  ■  HEAVY  BLACK 
MARK  In  the  circle  that  corresponds  to  your  answer. 
Be  aure  to  FILL  THE  CIRCLE. 

Pleaae  do  not  muke  any  stray  marks. 


COMtCT  MASK 

©•oo 


MCORAICT  MARKS 


i. 


Mark  the  cfrcMs)  corresponding  to  tho  swards  and  dsooratlona  listed  below 
1  If  you  ha 


that  you  have  received.  If  you  have  received  any  not  liatad  below,  use  tho 
spocais)  to  the  right  of  Other  to  write  in  tha  name(a)  of  the  award(a)  or 
decora tiornal  and  then  mark  tha  circles). 


] 

'] 

J 


j 


O  Air  Auault  Bacge 

O  Aircrait  Crewman  Badge 

O  Army  Acnavemant  Medal 

O  Arrry  Commendation  Medal  (Valor  or  Merit) 

O  Combat  Fled  Medical  Badge 

O  Ccmoat  Iniancry  Bacge 

Oliver  a  Bacge 

O  Omar  and  Mechanic  Badge 

O  ttxcert  Field  Mtdlcoi  Badge 

C  Incan  Infantry  Badge 


O  Exoioava  Ordnance  Qliootal  Badge 
Odood  Conduct  Medal 
O  Nuclear  Reactor  Operator  Badge 
©Farachutet  Badge 
O  Pathfinder  Badge 
O  Pureie  Heart 


For  tha  next  two  creations,  maik  tho  circle  corraapondlng  to  the  number  of  Latters 
and  Certificates  of  Appraciatlon,  Commendation.  Achiovament  that  you  havo  received. 
DO  NOT  count  Letters  or  Certificates  racaivad  for: 

r. 


•  Completion  of  AIT. 

•  Completion  of  any  training  courses  takan  after  AIT. 
e  Cumulation  of  Head  Start. 

•  Announcement  of  a  promotion. 

e  Announcement  of  an  award  or  decoration. 


2.  How  many  Lettors  of  Appreciation,  Commendation.  Achievement  have  you  received? 

Oo  03 

O'  O  4  or  more 


! 

3.  How  many  Certificates  of  Appreciation,  Commandotion,  Achievement  hove  you  recoived? 

Cl  03 

O'  O  A  of  more 

O: 


Figure  111.17.  Self-Report  Form  for  use  In  Concurrent  Validation.  (Page  1  of  2) 


4  What  was  your  last  Physical  Readiness  Test  Scora?  {Scores  rang*  from  0*300) 


Wnu  m*  tun 
m  llw  Min.  __ 


Th**  nu  tk  th* 
mmcf»n»]  IK* 
Mow  <Mcn  boo. 


&©q 

!O©0 


©0 

©•© 

©o 

©C' 

©c- 

®o 


What  was  your  last  M16  Qualification? 

' O  '.Unesrrun  fMKMl  O  Sharpshooter  (SPS)  O^P^t  (EXP) 


&  If  you  have  taken  a  Skill  Qualification  Test  (SQT),  what  was  your  most  recent 
score  (SQT  scores  range  from  0*100). 


M«k  T.r rn  ... 

K  yht  i.jk«  rwvof 

takon  jn  SQT. 


in  tho  boaoo. . 


’■Son  mm  mo 

motefwtq  end* 
bOqw  osvti  boo. 


©©! 

©a 

®® 


©$ 
©  © 


7.  How  many  Articles  15  have  you  received? 

Co  03 

1  O -  or  more 


8.  How  r.t.my  FLAG  Actions  have  you  received?  DO  NOT  count  a  LEVY  ALERT  a3  a  FLAG  Action. 

A  ’  O3 

'Z  ’  O  J  or  nvro 


Figure  1 1 1. 17.  Self-Report  Form  for  use  In  the  Concurrent  Validation.  (Page  2  of  2) 
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Section  16 


FIELD  TEST  RESULTS:  CRITERION  INTERRELATIONSHIPS! 


Up  to  this  point  we  have  considered  the  criterion  field  test  results  In 
terms  of  the  Item/scale  characteristics  and  reliabilities  of  each  measure. 
This  view  Is  consistent  with  the  principal  objective  of  the  field  tests  which 
was  to  provide  the  Information  necessary  for  revising  each  measure  as 
appropriate  for  the  Concurrent  Validation.  The  covariances  among  the 
measures  were  not  a  primary  concern  at  this  stage,  since  the  field  test 
samples  were  not  large  and  questions  of  latent  structure  and  criterion 
combination  could  be  handled  better  with  the  larger  samples  from  the 
Concurrent  Validation. 

However,  knowledge  of  the  Intercorrelatlons  Is  useful  for  uncovering 
potentially  aberrant  characteristics  of  the  measures  .and  for  formulating 
analytic  questions  to  ask  of  the  concurrent  sample  data.  Consequently,  a 
selected  set  of  Intercorrelatlon  matrixes  Is  presented  below. 


Representative  Criterion  Intercorrelatlons 


Some  might  accuse  Project  A  of  collecting  a  bit  too  much  data.  Such  an 
accusation  becomes  credible  when  the  Intent  Is  to  calculate  an  intercorrela¬ 
tlon  matrix  among  the  principal  criterion  measures.  The  list  Is  long  or 
short  depending  on  how  much  aggregation  one  Is  willing  to  tolerate.  If  the 
supervisor,  peer,  and  self  ratings  for  all  rating  scales  are  counted,  along 
with  hands-on  and  knowledge  test  scores  for  each  of  the  30  tasks,  the  total 
number  of  criterion  variables  adds  up  to  about  1,600.  That  Is  a  few  too  many 
to  Interpret  at  a  glance,  without  further  reduction. 

One  strategy  that  could  be  used  Is  cluster  or  factor  analysis.  However, 
rather  than  use  empirical  methods  with  such  relatively  small  samples,  we 
will  delay  these  analyses  until  the  concurrent  data  are  available. 

Instead,  we  reduced  the  number  of  variables  to  a  much  smaller  number,  by 
limiting  the  list  to  ratings  obtained  only  from  supervisors  and  peers,  and  by 
averaging  across  the  11  Army-wide  scales,  the  14  common  task  scales,  the  MOS 
task  scales,  and  the  MOS-spec 1 f 1c  BARS.  For  the  job  knowledge  tests,  scores 
were  totaled  for  the  15  tasks  measured  hands-on  and  separately  for  the  15 
tasks  not  measured  hands-on. 

After  all  this  was  done,  the  variable  list  was  reduced  to  16.  The 
resulting  matrixes  are  shown  In  Table  III. 79  for  the  nine  MOS. 


lThls  section  Is  a  revision  and  expansion  of  materials  In  the  paper, 
Criterion  Reduction  and  Combination  Via  a  Participative  Decision-Making 
Panel,  by  John  P.  Campbell  and  James  H.  Harris.  In  the  Aft  I  Research  Note  in 
preparation  which  supplements  this  Annual  Report. 


Table  III. 79 

Intercorrelatlon  Matrixes  for  16  Criterion  Measures  Obtained 
During  Criterion  Field  Tests,  by  MOS 


A.  Cannon  Crewman  (138) 


MEASURE* 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14  15  16 

Hands-On  Test 

1 

Total  Score 

Task  Performance  Ratlno 

2 

Aug.  ho  Task  Rating:  Supv. 

34 

3 

Avg.  HO  Task  Rating:  Peer 

47 

46 

Knowledge  Test 

4 

An  HO  Task's  ' 

41 

24 

18 

5 

All  Non-HO  Tasks 

21 

24 

06 

76 

Training  Test 

6 

Total  Score 

20 

13 

16 

59 

57 

AW  BARS 

7 

TCvgT  Rating;  Supv. 

25 

59 

34 

28 

26 

21 

64 

8 

Avg.  Rating:  Peer 

29 

48 

54 

30 

24 

25 

9 

AW  Rating 

Overall  Perf.:  Supv; 

19 

53 

26 

28 

27 

24 

77 

48 

10 

Overall  Perf.:  Peer 

24 

41 

58 

29 

21 

26 

49 

73 

63 

11 

NCO  Potential:  Supv. 

34 

47 

31 

27 

20 

13 

65 

44 

49 

34 

12 

NCO  Potential:  Peer 

28 

38 

50 

24 

16 

16 

54 

71 

43 

68 

46 

MOS  BARS 

13 

Avg.  Rating:  Supv. 

38 

63 

39 

35 

30 

26 

72 

48 

57 

45 

65 

45 

14 

Avg.  Rating:  Peer 

35 

53 

68 

27 

17 

26 

52 

74 

36 

65 

44 

61 

62 

AW  Common  Task  Rating 

15 

Avg.  Rating:  Supv. 

23 

51 

41 

26 

28 

20 

65 

44 

60 

38 

46 

37 

72 

53 

16 

Avg.  Rating:  Peer 

17 

36 

67 

26 

02 

30 

03 

62 

27 

59 

21 

44 

29 

65  36 

Tabl*  III. 79  (Continued) 


Intercorrelatlon  Matrixes  for  16  Criterion  Measures  Obtained 
During  Criterion  Field  Tests,  by  NOS 


B.  Motor  Transport  Operator  (MOS  64C) 


MEASURE*  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16 


32 

22  70 


58  23  10 
35  33  14  65 


31 

23 

01 

58 

74 

22 

67 

55 

17 

21 

21 

19 

60 

61 

17 

22 

16 

78 

19 

65 

47 

13 

13 

15 

23 

68 

11 

59 

63 

20 

18 

14 

72 

79 

57 

22 

56 

44 

14 

20 

28 

78 

60 

78 

50 

06 

39 

50 

05 

10 

05 

67 

77 

57 

73 

Hands-On  Test 

1  total  Score 

Task  Pe,-for;ndnce  Rating 

2  Avg,  W  rask  Rating:  Supv 

3  Avg.  HO  Task  Rating:  Peer 

Knowledge  Test 

4  AT)  HO  Tasks 

5  All  Non-HO  Tasks 

Training  Test 

6  Total  Score 

AW  BARS 

7  Avg.  Rating:  Supv. 

8  Avg.  Rating:  Peer 

AW  Rating 

9  Overall  Turf.:  Supv. 

10  Overall  Pe rf . :  Peer 

11  NCO  Potential :  Supv. 

12  NCO  Potential:  Peer 

MOS  BARS 

13  Avg.  Rating:  Supv, 

14  Avg,  Rating:  Peer 

AW  Common  Task  Rating 

15  Avg1. '  Rating:  Supv. 

16  Avg.  Rating:  Peer 


22 

25 

13 

23 

24 

13 

25 

21 

08 

56 

68 

20 

28 

20 

56 

72 

27 

69 

48 

25 

19 

20 

67 

59 

20 

40 

51 

12 

13 

06 

42 

55 

20  22  23  14 
47  67  42  53  25 


58  52  56  50  24  48 

39  58  37  55  23  51  57 


(Continued) 


*Code:  AW  Army-Wide 

BARS  Behaviorally  Anchored  Rating  Scale 
HO  Hands-On 


Table  1 1 1.79  (Continued) 

Intercorrelation  Matrixes  for  16  Criterion  Measures  Obtained 
During  Criterion  Field  Tests,  by  MOS 

C.  Administrative  Specialist  (MOS  711. ) 

MEASURE*  1  2  3  4  i  i  7  8  9  10  11  12  13  14  15  16 


Hands-On  Test 
1  Total  Score 


Task  Performance  Rati na 

2 

Avg.  HO  Task  Rating:  Supv. 

23 

3 

Avg.  HO  Task 

Rating:  Peer 

‘16 

77 

Knowledge  Test 

4 

JTTWTasks 

S2 

12 

S 

All  Non-HO  Tasks 

43 

08 

Tralnlna  Test 

6 

Total  Score 

54 

22 

AW  BARS 

7 

Avg.  Rating: 

Supv. 

24 

54 

8 

Avg.  Rating: 

Peer 

10 

50 

AW  Rating 

9 

Overall  Perf. 

:  Supv. 

33 

49 

10 

Overall  Perf. 

:  Peer 

17 

35 

11 

NCO  Potential 

:  Supv. 

24 

45 

12 

NCO  Potential 

:  Peer 

13 

35 

MOS  BARS 

13 

Avg.  Rating: 

Supv. 

23 

60 

14 

Avg.  Rating: 

Peer 

23 

55 

AW  Common  Task  Rating 

15 

Avg.  Rating: 

Supv. 

08 

41 

16 

Avg.  Rating: 

Peer 

30 

20 

02 

05 

68 

. 

23 

63 

51 

36 

19 

23 

24 

40 

04 

06 

14 

80 

23 

19 

20 

25 

69 

64 

48 

13 

22 

18 

67 

78 

60 

20 

07 

17 

22 

64 

52 

62 

37 

41 

06 

15 

20 

68 

76 

41 

,  28 

40 

39 

15 

10 

24 

68 

48 

60 

26 

54 

17 

57 

09 

14 

24 

62 

60 

56 

37 

51 

38 

78 

25 

02 

03 

06 

49 

26 

46 

23 

33 

13 

60 

07 

21 

24 

29 

46 

37 

46 

24 

49 

18 

38 

(Continued) 

*Code:  AW  Army-Wide 

BARS  Behavlorally  Anchored  Rating  Scale 
HO  Hands-On 


1 1 1-200 


Table  111.79  (Continued) 

Intercorrelation  Matrixes  for  16  Criterion  Measures  Obtained 
During  Criterion  Field  Tests,  by  MOS 

D.  Military  Police  (958) 


MEASURE* 

1 

2 

3 

'4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

Hands-On  Test 

1 

1 

Total  Score 

Task  Performance  Ratlnq 

2 

Avg.  H^.task  Rating:  Supv. 

27 

3 

Avg.  HO  Task  Rating:  Peer 

31 

65 

Knowledqe  Test 

4 

All  ho  Tasks 

11 

17 

14 

5 

All  Non-HO  Tasks 

21 

15 

08 

SO 

Tralninq  Test 

6 

Total  Score 

11 

05 

10 

43 

56 

AW  BARS 

7 

Avg.  Rating:  Supv. 

26 

68 

53 

23 

25 

22 

a 

Avg.  Rating:  Peer 

35 

56 

70 

09 

17 

21 

76 

AW  Rating 

9 

Overa'l)  Perf.:  Supv. 

33 

65 

45 

17 

24 

17 

83 

66 

10 

Overall  Perf.:  Peer 

28 

57 

70 

13 

15 

15 

70 

84 

64 

11 

NCO  Potential :  Supv. 

28 

64 

49 

18 

19 

21 

73 

59 

77 

56 

12 

NCO  Potential:  Peer 

28 

57 

67 

13 

18 

27 

67 

84 

57 

82 

54 

MOS  BARS 

13 

Avg.  Rating:  Supv, 

23 

79 

58 

16 

18 

12 

75 

65 

75 

59 

70 

57 

14 

Avg.  Rating:  Peer 

31 

61 

79 

OR 

07 

19 

60 

85 

56 

77 

53 

75 

72 

AW  Common  Task  Ratlnq 

15 

Avg.  Rating:  Supv. 

40 

69 

59 

18 

18 

20 

59 

50 

62 

53 

62 

47 

65 

16 

Avg.  Rating:  Peer 

40 

52 

78 

03 

03 

07 

45 

63 

43 

67 

37 

63 

46 

(Continued) 

'Code:  AW  Army -Wide 
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HO  Hands-On 
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Table  III. 79  (Continued) 


Intercorrelation  Matrixes  for  16  Criterion  Measures  Obtained 
During  Criterion  Field  Tests,  by  MOS 


E.  Infantryman  (MOS  11B) 


MEASURE* 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15  16 

Hands-On  Test 

1 

Total  Score 

Task  Performance  Rating 

2 

Avg.  HO  Task  Rating:  Supv. 

46 

3 

Avg.  HO  Task  Rating:  Peer 

36 

61 

Knowledqe  Test 

4 

All  HO  Tasks 

55 

39 

30 

5 

All  Non-HO  Tasks 

41 

29 

28 

78 

Tralnlnq  Test 

6 

total  Score 

40 

34 

31 

70 

71 

• 

AW  BARS 

7 

Avg.  Rating:  Supv. 

35 

67 

44 

28 

n 

23 

8 

Avg.  Rating:  Peer 

37 

56 

58 

23 

26 

26 

79 

AW  Rating 

9 

Overall  Parf.:  Supv. 

29 

61 

40 

24 

21 

12 

81 

S3 

10 

Overall  Parf. :  Peer 

34 

52 

51 

22 

25 

25 

65 

85 

56 

11 

NCO  Potenttal :  Supv. 

29 

55 

34 

25 

20 

15 

83 

60 

76 

56 

12' 

NCO  Potential :  Peer 

41 

43 

47 

29 

31 

27 

62 

85 

47 

76 

4; 

MOS  BARS 

13 

Avg.  Rating:  Supv. 

41' 

76 

53 

31 

25 

21 

84 

69 

77 

62 

73 

55 

14 

Avg.  Rating:  Peer 

40 

55 

69 

22 

24 

24 

60 

04 

55 

79 

50 

71 

70 

AW  Common  Task  Ratlnq 

15 

Avg.  Rating:  Supv, 

39 

74 

48 

31 

22 

17 

72 

61 

67 

54 

62 

48 

83 

61 

16 

Avg.  Rating:  Peer 

43 

56 

68 

32 

33 

30 

59 

77 

54 

71 

44 

65 

63 

82 

68 

(Continued) 
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Table  III. 79  (Cohtlnued) 

Intercorrelation  Matrixes  for  16  Criterion  Measures  Obtained 
During  Criterion  Field  Tests,  by  MOS 

F.  Armor  Crewman  (MOS  19E) 


1  2 


MEASURE* 


Hands-On  Test 
total  Score 


„  task  Performance  Ratlnj 

2  Avg.  HO  Task  Rating;  Supv .  09 

3  Avg.  HO  Task  Rating:  Peer  10  50. 


Knowledge  Test 

4  AVI "WO 'Tails . 

5  All  Non-HO  Tasks 

Training  Test 

6  Total  score’ 

AW  BARS 

7.  Avg.  Rating:  Supv. 

8  Avg.  Rating:  Peer 

AW  Rating 

9  Overall  Perf,:  Supv. 

10  Overall  Perf,:  Peer 

11  NCO  Potential :  Supv. 

12  NCO  Potential :  Peer 

MOS  BARS 

13  Avg.  Rating:  Supv. 

14  Avg.  Rating:  Peer 

AW  Common  Task  R at  1 n< 

15  Avg.  Rating:  Supv. 

16  Avg.  Rating:  Peer 


7  8 


10  11  12  13  14  15  16 


I 


i: 

y 


I 

I 


I 


39 

19 

16 

ny 

32 

13 

07 

74 

Pj 

25 

22 

13 

58 

64 

| 

j 

15 

52 

39 

05 

07 

15 

15 

25 

53 

07 

06 

21 

66 

1.2 

43 

33 

15 

15 

20 

82 

53 

| 

13 

28 

51 

17 

12 

27 

50 

82 

42 

i 

18 

46 

36 

08 

09 

17 

77 

54 

66 

43 

I 

25 

32 

46 

15 

10 

24 

57 

79 

44 

72 

50 

r 

14 

70 

50 

27 

22 

30 

69 

40 

60 

37 

60 

42 

-V 

12 

30 

62 

24 

14 

21 

41 

65 

42 

64 

36 

58 

59 

k>, 

20 

50 

38 

20 

20 

16 

52 

27 

50 

31 

37 

26 

54 

33  kS 

15 

34 

63 

25 

14 

22 

35 

63 

38 

54 

26 

46 

42 

64  51  i 

Table  III .79  (Continued) 

Intercorrelation  Matrixes  for  16  Criterion  Measures  Obtained 
Ourlng  Crltorion  Field  Tests,  by  MOS 

G.  Radio  Teletype  Operator  (MOS  31Cj 

MEASURE*  I  2  3  4  5  6  7  1  9  10  11  12  13  14  15  16 


Hands-On  Test 
1  Total  Score 

Task  Performance  Rating 


2 

Avg.  HO  task  Rating:  Supv. 

17 

3 

Avg.  HO  Task  Rating:  Peer 

18 

71 

Knowledqe  Test 

4 

All  Hfl  Tasks 

37 

21 

5 

All  Non-HO  Tasks 

35 

18 

Training  Test 

6 

total  Score 

26 

21 

AW  BARS 

7 

Avg.  Rating:  Supv, 

08 

59 

8 

Avg.  Rating:  Peer 

08 

42 

AW  Rating 

9 

Overall  Perf . :  Supv. 

11 

64 

10 

Overall  Perf.:  Peer 

-01 

47 

11 

NCO  Potential :  Supv. 

19 

59 

12 

NCO  Potential :  Peer 

20 

42 

MOS  BARS 

13 

Avg.  Rating:  Supv. 

10 

70 

14 

Avg.  Rating:  Peer 

08 

49 

AW  Common  Task  Rating 

15 

Avg.  Rating:  Supv. 

12 

48 

16 

Avg.  Rating:  Peer 

03 

33 

/Coda:  AW  Army-Wide 

BARS  Behavi orally  Anchored  Rating 
HO  Hands-On 


20 

22 

58 

23 

50 

40 

46 

15 

13 

13 

62 

17 

23 

21 

68 

51 

11 

15 

15 

91 

61 

61 

14 

17 

17 

58 

83 

54 

44 

26 

21 

19 

83 

56 

80 

48 

61 

19 

28 

24 

59 

81 

57 

71 

60 

59 

23 

13 

26 

75 

55 

75 

55 

61 

50 

74 

11 

16 

21 

56 

82 

52 

75 

44 

64 

66 

39 

07 

08 

14 

51 

26 

52 

30 

46 

34 

49 

33 

54 

02 

11 

03 

32 

53 

37 

50 

31 

49 

25 

43 

(Continued) 
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Table  II I. 79  (Continued) 

Intercorrelation  Matrixes  for  16  Criterion  Measures  Obtained 
During  Criterion  Field  Tests,  by  MOS 

H,  Light-Wheel  Vehicle  Mechanic  (MOS  638) 

MEASURE*  I  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16 


Hands-On  Test 

1  Total  Score 

Task  Performance  Rating 

2  Avg.  HO  Task  Rating:  Supv.  18 

3  Avg.  HO  Task  Rating:  Peer  12 


knowledge  Test 

4  All  HO  Tasks  31 

5  All  Non-HO  Tasks  32 

Training  Test 

6  Yotal  Score  37 

AM  bar: 

7  Avg.  Rating:  Supv.  19 

8  Avg.  Rating:  Peer  15 

AW  Rating 

9  Overall  Perf . :  Supv.  24 

10  Overall  Perf.:  Peer  18 

11  NCO  Potential:  Supv.  11 

12  NCO  Potential:  Peer  16 

MOS  BARS 

13  Avg.  Rating:  Supv.  20 

[  14  Avg.  Rating:  Peer  21 


59 

08 

dZ 

15 

28 

66 

22 

.19 

52 

61 

71 

57 

15 

18 

16 

49 

66 

26 

30 

15 

73 

71 

58 

16 

22 

23 

85 

65 

49 

52 

25 

22 

10 

65 

81 

60 

60 

51 

06 

10 

12 

78 

54 

67 

57 

56 

54 

21 

23 

12 

66 

81 

63 

80 

56 

80 

61 

14 

16 

24 

90 

68 

81 

60 

72 

63 

64 

72 

24 

26 

25 

69 

70 

70 

65 

54 

65 

77 

59 

46 

06 

23 

13 

62 

5/ 

63 

55 

54 

57 

66 

63 

36 

46 

18 

35 

26 

43 

57 

39 

49 

31 

46 

43 

48  63 

AW  Common  Task  Rating 

15  Avg.  Rating:  Supv.  07 

16  Avg.  Rating:  Peer  -03 


(Continued) 
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Table  III. 79  (Continued) 


Intercorrelation  Matrixes  for  16  Criterion  Measures  Obtained 
During  Criterion  Field  Tests,  by  NOS 

I.  Medical  Specialist  (HOS  91A) 


MEASURE* 

1 

2 

3 

Hands-On  Test 

1 

Total  Score 

Task  Performance  Rating 

2 

Avg.  HO  Task  Rating:  Supv. 

16 

3 

Avg.  HO  Task  Rating:  Peer 

19 

61 

Knowledge  Test 

4 

'ATT  ffiTTasTcs 

21 

00 

03 

5 

All  Non-HO  Tasks 

21 

04 

05 

Training  Test 

6 

Total  Score 

07 

-07 

-08 

AW  BARS 

7 

Avg.  Rating:  Supv. 

12 

59 

44 

8 

Avg.  Rating:  Peer 

13 

42 

50 

AW  Rating 

9 

Overall  Perf.:  Supv. 

14 

49 

35 

10 

Overall  Perf.:  Peer 

13 

39 

45 

11 

NCO  Potential :  Supv. 

13 

51 

42 

12 

NCO  Potential :  Peer 

15 

38 

48 

MOS  BARS 

13 

Avg.  Rating:  Supv. 

14 

61 

50 

14 

Avg.  Rating:  Peer 

12 

39 

64 

AW  Common  Task  Ratlnq 

15 

Avg.  Rating:  Supv. 

13 

55 

27 

16 

Avg.  Rating:  Peer 

20 

43 

48 

4  5  6  7  8  9  10  11  12  13  14  lb  16 


61 

39 

31 

09 

14 

04 

09 

09 

07 

79 

13 

16 

11 

89 

67 

09 

05 

23 

70 

89 

62 

10 

17 

01 

81 

62 

82 

57 

09 

10 

17 

64 

81 

58 

75 

58 

11 

20 

13 

78 

67 

68 

63 

59 

57 

08 

05 

15 

61 

80 

50 

77 

43 

68 

68 

05 

19 

-10 

62 

42 

54 

43 

50 

32 

60 

02 

14 

-01 

45 

54 

35 

54 

30 

50 

50 

(Continued) 
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These  matrixes  illustrate  some  basic  truths.  First,  the  methods  cor¬ 
relate  more  highly  within  themselves  than  they  do  across  measures.  If  one 
were  to  examine  a  multimethod  (hands-on,  knowledge  tests,  ratings),  multi¬ 
trait  (the  IE  tasks)  matrix  and  submit  it  to  a  factor  analysis,  the  factors 
would  most  likely  be  defined  by  methods  rather  than  job  tasks.  This  Is  not 
unlike  what  happens  when  individual  assessment  center  measures  are  factored. 
Factors  tend  to  be  defined  by  the  particular  exercise  or  test  rather  than  the 
trait  (Sackett  &  Dreher,  1982). 

However,  two  points  are  crucial.  Although  the  variables  of  task  pro¬ 
ficiency,  job  knowledge,  and  general  soldiering  performance  ere  certainly  not 
Independent,  they  are  alsc  far  from  being  identical  In  spite  of  the  Influence 
of  method  variance.  Also,  one  of  the  great  unanswered  questions  in  applied 
psychology  remains:  Is  what  v/e  refer  to  as  method  variance  (e.g.,  halo)  in 
ratings,  paper-and-pencll  knowledge  tests,  or  job  sample  tests  really  rele¬ 
vant  and  valid,  or  is  It  simply  noise?  It  Is  not  necessarily  error,  but  may 
Indeed  reflect  individual  differences  in  performance  that  are  quite  relevant. 


True  Score  Relationships 

The  intercorrelations  in  the  previous  table  are  between  uncorrected 
scores  on  each  variable.  To  get  closer  to  the  "truth"  about  the  criterion 
space,  the  Intercorrelations  were  corrected  for  attenuation,  which  yielded  an 
estimate  of  the  true  score  Intercorrelation  matrix.  As  illustrations, 
matrixes  for  MOS  11 B  and  MOS  71L  are  shown  In  Tables  III. 80  and  111,81, 

These  correlations  were  computed  on  the  assumption  that  the  most 
accurate  portrayal  of  the  structure  of  the  criterion  space  is  provided  by  the 
Interrelationships  among  the  true  scores.  Estimating  true  score  correlations 
by  correcting  for  attenuation  Is  a  dangerous  business  that  must  be  carefully 
done.  The  reliabilities  that  were  used  for  Tables  III. CO  and  III. 81  are 
conservative  In  that  they  do  not  Include  all  the  sources  of  error  that  might 
account  for  unreliability.  For  example,  variability  across  testing  occasions 
is  not  counted  here  but  it  might  in  fact  serve  to  lower  the  correlations 
between  pairs  of  variables  (e.g.,  hands-on  and  knowledge  tests).  Also,  to 
give  more  stability  to  the  estimates  the  adjusted  correlations  for  supervisor 
and  peer  ratings  were  simply  averaged. 

Looking  at  the  true  score  Intercorrelations,  It  seems  reasonable  to 
conclude  that  the  hands-on  measures  and  the  knowledge  tests  designed  to  be 
parallel  to  them  share  a  significant  proportion  of  their  variance.  The 
Army-wide  rating  measures  of  general  soldier  performance  are  by  no  means 
Independent  of  the  job  sample  measures,  but  they  have  less  in  common  with  job 
samples  than  do  the  knowledge  tests. 

One  large  difference  between  Tables  III. 80  and  III. 81  is  in  the  lower 
correlations  for  MOS  711  between  the  ratings  and  the  other  variables, 
particularly  the  ratings  of  specific  task  performance.  However,  Adminis¬ 
trative  Specialists  tend  to  work  more  in  isolation  than  other  MOS  and  are  not 
observed  as  closely.  It  all  seems  to  make  reasonable  sense. 


Table  111.80 


Intercorrelations  Among  Selected  Criterion  Measures  for 
Infantryman  (MOS  UB)a 


Measure 

* 

i 

2 

3 

4 

5 

1 

Total  Score  on  all  HO  Tasks 

(  ) 

,67 

.86 

.60 

.57 

2 

Avg.  of  15  HO  task  ratings  (Supv.  +  Peer)b 

.41 

(  ) 

.44 

.39 

.73 

3 

Total  score  on  Job  Knowledge  Test 

.55 

.35 

(  ) 

.80 

.31 

4 

Total  score  on  Training  Knowledge  Test 

.40 

.33 

.70 

(  ) 

.25 

5 

AW  BARS  -  Overall  Effectiveness  (Supv.  +  Peer) 

.32 

.51 

.23 

.19 

(  ) 

a  Correlations  corrected  for  attenuation  are  above  the  diagonal. 
b  The  corresponding  correlations  for  supervisory  ratings  and  peer  ratings 
were  averaged. 


Table  I I I. 81 

Intercorrelatlons  Among  Selected  Criterion  Measures  for 
Administrative  Specialist  (MOS  71L)« 


Measure 

1 

2 

3 

4 

5 

1 

Total 

Score  on  all  HO  Tasks 

(  ) 

.28 

.76 

.73 

.35 

2 

Avg. 

of  15  HO  task  ratings  (Supv.  +  Peer)b 

.20 

(  ) 

.10 

.29 

.51 

3 

Total 

score  on  Job  Knowledge  Test 

.52 

.07 

(  ) 

.82 

.22 

4 

Total 

score  on  Training  Knowledge  Test 

.54 

.23 

.63 

(  ) 

.27 

5 

AW  BARS  -  Overall  Effectiveness  (Supv.  +  Peer) 

.25 

.39 

.16 

.22 

(  ) 

a  Correlations  corrected  for  attenuation  are  above  the  diagonal. 
b  The  corresponding  correlations  for  supervisory  ratings  and  peer 
ratings  were  averaged. 


1 1 1-208 


'Vrvr.  •'V.V' 


% 

ft 

* 


•s 

* 


|0r< 

Ki 

K 

ji 

ft; 

| 

I 

4 


Summary 

In  general,  the  covariance  among  criteria  did  not  reveal  any  fatal  flaws 
In  the  array  of  measures  constructed  to  cover  the  domain  of  job  performance. 
While  there  Is  considerable  method  variance  among  the  ratings,  there  Is  also 
a  positive  manifold  In  the  matrixes,  which  suggests  that  there  Is  Indeed  a 
latent  structure  to  be  Investigated.  Performance  Is  certainly  not  one  thing 
and  the  pattern  of  correlations  Is  conceptually  sensible.  It  whets  the  appe¬ 
tite  for  a  more  systematic  Investigation  of  the  latent  structure  of  job  per¬ 
formance.  Some  specu ;  r  lons  about  that  structure  are  summarized  in  the 
following  subsection. 


A  Plausible  Model 


Even  though  the  major  data  collections  and  analyses  are  yet  to  come,  a 
great  deal  has  been  learned  to  date  about  the  domain  of  first-tour  enlisted 
performance.  The  total  domain  was  described  via  two  major  collections  of 
critical  Incidents,  systematic  examination  of  all  available  job  survey  data, 
review  of  all  job  specification  documentation,  careful  analysis  of  AIT 
Programs  of  Instruction,  and  multiple  reviews  and  elaborations  by  many  panels 
of  expert  Judges.  Subsequent  to  the  job  and  task  descriptions,  multiple 
methods  of  performance  measurement  were  developed,  pilot  tested,  revised, 
field  tested,  revised,  reviewed,  and  revised  again. 


As  a  consequence, ■  we  have  formed  some  further  Ideas  of  how  the  latent 
structure  of  job  performance  might  look  when  cast  against  our  operational 
measures.  This  model  Is  not  meant  to  be  definitive  or  even  based  on  the  most 
relevant  data  (e.g.,  the  covariances  to  be  obtained  In  the  Concurrent  Valida¬ 
tion).  Rather  It  Is  meant  to  be  consistent  with  what  we  know  so  far  and  to 
Illustrate  the  kind  of  job  performance  framework  toward  which  we  are  working. 


As  a  first  attempt  at  portraying  the  latent  structure,  suppose  we  sug¬ 
gest  that  the  enlisted  performance  domain  Is  made  up  of  the  following  general 
factors: 


(1)  Maintaining  and 
tasks),  A  legitimate  quest 
knowledge  should  be  a  factor 
goal  of  the  Army  Is  to  be 


upgradl 
t Ion  he 


HL 


current  Job  knowledge  (Including  common 
might  be  why  the  mere  possession  of  job 


ere 

In  the  performance  domain.  However,  If  a 
ready  to  enter  a  conflict  on  short  notice 


major 

.  .  . .  .......  _ _  _ .then 

possessing  a  high  degree  of  current  knowledge  Is  performance.  Having  the 
proper  Information  and  being  able  to  use  It  (Tactor  2)  are  not  the  same 
thing.  However,  neither  are  they  Independent.  Consequently,  our  model  must 
stipulate  that  these  first  two  factors  are  significantly  correlated  and  the 
relationship  stems  both  from  sharing  common  requirements  (e.g.,  general 
cognitive  ability)  and  from  Factor  1  being.  In  part,  a  "cause"  of  performance 
differences  on  Factor  2. 


(2)  Having  technical  proficiency  on  the  'rimary  job  tasks.  This  factor 
refers  to  being  able  to  perform  on  the  technical  content,  be  It  complicated 
or  simple.  Technical  Is  defined  broadly  but  not  so  broadly  as  to  Include 
leadership  or  other  Interpersonal  Interaction  task  requirements.  Within  this 
construct  the  content  of  the  tasks  may  vary  considerably  and  rely  on  very 
different  abilities  (e.g.,  playing  a  musical  Instrument  vs.  repairing  a  truck 
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generator).  For  most  jobs  It  might  also  be  possible  to  think  of  two  such 
general  factors,  executing  "standard"  procedures  and  troubleshooting  special 
problems. 

(3)  Exhibiting  peer  leadership  and  support.  Enlisted  personnel  often 
have  the  opportunity  to  teach,  support,  or  provide  leadership  for  their 
peers.  This  factor  refers  to  the  frequency  and  proficiency  with  which  people 
do  that  when  the  occasion  arises.  It  would  also  be  reasonable  to  think  of 
this  factor  as  composed  of  the  four  subgeneral  factors  that  have  been  found 
In  leadership  research  (e.g.,  Bowers  1  Seashore,  1966):  goal  setting, 
facilitating  goal  attainment,  one-on-one  Individual  support,  and  facilitating 
group  morale. 

(4)  Demonstrating  commitment  to  Army  regulations  and  traditions. 
Performance  on  this  factor  refers  to  maintaining  living  quarters  and  equip- 
ment,  and  maintaining  a  high  level  of  physical  fitness  and  appropriate 
military  appearance.  This  factor  Is  perhaps  a  bit  more  tenuous  than  the 
others.  Defining  It  this  way  assumes  that  all  of  the  different  elements  will 
covary  to  a  high  degree. 

(5)  Continuing  to  perform  under  adverse  conditions.  This  factor  would 
share  many  components  In  common  with  the  previous  three  and  thus  should  not 
be  orthogonal.  However,  the  act  of  carrying  out  job  assignments  when  wet, 
tired,  or  In  danger  Is  viewed  as  a  very  Important  and  distinct  aspect  of 
performance. 

(6)  Avoiding  serious  disciplinary  problems.  Incurring  disciplinary 
actions  because  of  problems  with  drugs,  alcohol ",  neglect  of  duty,  or  serious 
Interpersonal  conflict  represents  a  great  cost  to  the  Army.  Successfully 
avoiding  these  costs  Is  viewed  as  an  Important  factor  In  overall  performance. 

Standing  In  a  direct  causal  relation  to  the  performance  factors  are 
knowledges  and  skills  learned  dur  ng  training,  abilities  and  other  Individual 
characteristics  present  at  the  time  of  hire,  and  the  choice  to  perform,  which 
Is  supposedly  under  motivational  control.  For  our  purposes  here,  the  causal 
latent  variables  of  most  concern  are  the  knowledges,  skills,  and  motivational 
predispositions  acquired  during  training.  Consequently,  v/e  might  posit  that 
there  are  three  major  training  performance  factors  in  the  latent  structure: 

t  Hands-on  task  proficiency. 

•  General  job  knowledge. 

•  Exhibition  of  good  soldiering  skills  and  discipline. 

A  very  rough  schematic  that  portrays  these  latent  variables  and  lists 
the  observable  measures  of  these  variables  that  we  have  available  In  Project 
A  Is  shown  In  Figure  III. 18.  The  arrows  between  latent  variables  and  opera¬ 
tional  measures  Indicate  an  expected  correlation;  the  expected  size  of  the 
correlation  is  not  Indicated.  Arrows  between  latent  variables  Indicate  a 
hypothesized  causal  relation. 

Several  points  can  be  made  about  this  picture.  First,  the  principal 
data  upon  which  the  list  of  latent  constructs  Is  based  are  the  results  of  the 
critical  Incident  workshops  conducted  during  the  development  of  the 
behavlorally  anchored  ratings  scales.  We  have  not  yet  had  the  opportunity  to 
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Figure  1 1 1. 18.  Job  Performance  —  A  Proposed  Structural  Model. 


examine  the  factor  structure  of  the  hands-on  measures  or  knowledge  tests,  or 
even  to  look  comprehensively  at  the  factor  structure  of  all  the  rating 
scales.  These  analyses  really  must  wait  until  larger  sample  sizes  are  avail¬ 
able  with  the  revised  measures. 

Second,  the  manifest  job  performance  variables  are  by  no  means  “pure" 
measures  of  the  latent  constructs.  For  example,  Factor  2  would  seem  to 
underlie  virtually  all  of  the  observable  measures.  By  contrast,  the  "avoid¬ 
ance  of  disciplinary  problems11  should  Influence  only  some  of  the  Army-wide 
BARS  scales,  NCO  potential,  attrition,  and  perhaps  expected  combat  perfor¬ 
mance.  However,  In  general,  most  of  the  observable  variables  are  probably 
multiply  determined. 

Third,  the  above  reasoning  suggests  that  If  the  operational  criteria 
share  so  many  common  determinants,  they  probably  should  not  receive  grossly 
differential  weights  when  combined  Into  composites  for  the  purpose  of  test 
validation. 

Fourth,  differential  prediction  of  job  performance  across  jobs  must  come 
from  different  requirements  for  success  on  Factors  1  and  2  (e.g.,  psychomotor 
abilities  vs.  verbal  ability).  To  a  certain  extent  It  could  also  result  from 
a  greater  weight  being  given  In  some  MOS  to  peer  leadership  and  performance 
under  adverse  conditions. 

Fifth,  limiting  measures  of  training  success  to  paper-and-pencll  tests 
of  knowledges  mastered  Is  probably  not  sufficient.  To  more  completely  deter¬ 
mine  the  relationship  of  training  performance,  additional  measures  would  be 
required. 

Sixth,  the  causal  relations  among  the  Individual  differences  present  at 
the  time  of  entry,  the  latent  variables  making  up  training  performance,  the 
latent  variables  that  constitute  job  performance,  and. the  operational  crite¬ 
rion  measures  can  be  described  with  brilliant  understatement  by  saying  that 
they  are  complex.  As  part  of  that  complexity  It  seems  reasonable  to  assert 
that: 

•  Among  the  latent  variables  describing  training  performance,  hands- 
on  proficiency  and  content  knowledge  are  most  likely  more  highly 
related  to  each  other  than  either  is  to  soldiering  skills  and 
discipline.  Further,  content  knowledge  stands  In  at  least  a 
partial  causal  relation  to  hands-on  proficiency. 

•  Among  the  latent  variables  describing  job  performance,  job  knowl¬ 
edge  would  seem  to  come  first  in  the  causal  chain  since  It  at  least 
partially  determines  technical  proficiency.  However,  both  these 
factors  most  likely  cause  at  least  some  of  the  Individual  dif¬ 
ferences  In  peer  leadership  performance.  A  causal  relation  between 
technical  proficiency  and  either  commitment  to  regul atlons/tradl- 
tlons  or  avoiding  disciplinary  problems  does  not  seem  so  likely. 
However,  some  may  wonder  whether  commitment  to  regulations/ 
traditions  and  avoidance  of  disciplinary  problems  are  bipolar. 


•  If  the  first  two  factors  were  measured  with  high  construct 

validity,  Factor  1  (current  job  knowledge)  should  have  a  direct 
effect  only  on  job  knowledge  tests.  Job  knowledge  should  create 
differences  on  other  operational  measures  only  through  Its 
influence  on  technical  proficiency.  Consequently,  If  technical 
proficiency  could  be  held  constant,  the  observed  correlations 
between  job  knowledge  tests  and  all  other  variables  should  be 
reduced  to  zero. 

•  Since  peer  leadership  and  support  was  given  a  broad  definition  (In 
terms  of  leadership  theory),  greater  knowledge,  higher  technical 
skill,  higher  commitment,  demonstrated  performance  under  stress, 
and  an  exemplary  record  would  all  "cause"  an  Individual  to  exhibit 
more  effective  peer  leadership. 

•  As  somewhat  of  a  contrast,  performance  under  adverse  conditions  Is 
conceptualized  as  a  dispositional  variable.  Consequently  It  would 
be  under  motivational  control  and  not  a  function  of  knowledge  or 
ability. 

The  reader  should  keep  firmly  In  mind  that  the  above  comments  are  still 
speculative.  Such  a  model  of  performance  will  go  through  many  Iterations 
before  Project  A  Is  finished.  However,  the  first  major  empirical  specifica¬ 
tions  will  be  based  on  the  Concurrent  Validation,  which  began  In  FY85  and 
which  will  be  analyzed  in  FY86.  The  concluding  sections  of  this  report  (Part 
IV)  briefly  outline  the  data  collection  procedure  and  the  analysis  plan  for 
this  data  set. 


CONCURRENT  VALIDATION 


Included  In  Part  IV  are  a  listing  of  the  predictor  and 
criterion  arrays  used  In  the  Concurrent  Validation,  a  descrip¬ 
tion  of  the  samples,  a  brief  summary  of  the  procedures  used 
for  data  collection,  and  an  outline  of  the  analyses  that  will 
be  carried  out.  The  data  collection  Itself  and  the  basic  data 
analyses  will  be  completed  during  FY8(j. 


Section  1 

CONCURRENT  VALIDATION:  PREDICTOR  AND  CRITERION  VARIABLES! 


Parts  II  and  III  of  this  report  have  described  the  development  and  field 
testing  of  the  predictor  tests  and  performance  measures  to  be  used  In  the 
Concurrent  Validation  phase  of  Project  A.  The  predictors  were  common  across 
all  jobs,  but  not  all  of  the  performance  measures  were  used  with  every  MOS  In 
the  total  Concurrent  Validation  sample. 

The  nomenclature  for  MOS  groupings  also  changed  slightly  for  the  Con¬ 
current  Validation  phase.  Batch  A  and  Batch  B  MOS  are  now  known  collectively 
as  Batch  A;  they  are  the  nine  MOS  that  were  used  In  the  criterion  field 
tests.  The  remaining  10  MOS  are  still  designated  as  3atch  Z.  Batch  A  and 
Batch  Z  MOS  are  listed  In  Table  IV. 1. 


Table  IV.l 

Project  A  MOS  Used  In  the  Concurrent  Validation  Phase 


Batch  A 


1  IB  Infantryman 

13B  Cannon  Crewman 

19E  Armor  Crewman 

31 C  Radio  Teletype  Operator 

63B  Light-Wheel  Vehicle  Mechanic 

64C  Motor  transport  Operator 

71L  Administrative  Specialist 

91 A  Medical  Specialist 

95B  Military  Police 


Batch  Z 


12B  Combat  Engineer 

1SS  MANPADS  Crewman 

27E  TOW/Dragon  Repairer 

51B  Carpentry /Masonry  Specialist 

54E  Chemical  Operations  Specialist 

55B  Ammunition  Specialist 

67N  Utility  Helicopter  Repairer 

76W  Petroleum  Supply  Specialist 

76Y  Unit  Supply  Specialist 

94B  Food  service  Specialist 


While  the  same  predictor  battery  was  used  for  each  MOS,  the  criterion 
measures  used  for  Batch  A  MOS  were  different  than  those  used  for  MOS  In  Batch 
Z.  The  major  distinction  Is  that  the  MOS-specIflc  job  performance  and  job 
knowledge  measures  developed  for  the  Batch  A  MOS  were  not  prepared  for  the  10 
MOS  In  Batch  Z.  For  these  jobs  only  the  Army-wide  measures  and  the  training 
achievement  tests  were  available. 

The  complete  array  of  predictors  In  the  Trial  Battery  Is  listed  In  Table 
IV. 2.  The  list  of  criteria  for  Batch  A  and  Batch  Z  are  shown  In  Table  IV. 3. 
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1  Part  IV  Is  based  In  part  on  a  paper,  Criterion  Reduction  and  Combination 
Via  a  Participative  Decision-Making  Pane*,  by  John  T.  Campbel  1  and  James  If. 
TTarrfsT  tr>  the  'ART  Research  Rote  Tn  preparation,  which  supplements  this 
Annual  Report. 


Table  IV. 2 

Summary  of  Predictor  Measures  Used  In  Concurrent  Validation: 
The  Trial  Battery 


Name  Number  of  Items 

COGNITIVE  PAPER-AND-PENCIL  TESTS 

Rationing  Teat  (Induction  -  Flgural  Rationing)  30 

Objact  Rotation  Tait  (Spatial  Vliuallzatlon  ••  Rotation)  90 

Orientation  Tait  (Spatial  Orientation)  24 

Haia  Tait  (Spatial  Vliuallzatlon  ••  Scanning)  24 

Hap  Tait  (Spatial  Orlantation)  20 

Assembling  Objactt  Tait  (Spatial  Vliuallzatlon  -  .32 

Rotation) 

COMPUTER-ADMINISTERED  TESTS 

Slmpla  Reaction  Tima  (Procaulng  efficiency)  15 

Choice  Reaction  Tima  (Procaulng  efficiency)  30 

Memory  Tait  (Short-term  memory)  36 

Target  Tracking  Tait  1  (Piychomotor  preclilon)  18 

Perceptual  Speed  and  Accuracy  Tait  (Perceptual  36 

ipend  and  accuracy) 

Target  Tracking  Tait  2  (Two-hand  coordination)  18 

Number  Memory  Tait  (Number  operation!)  28 

Cannon  Shoot  Tait  (Movemant  Judgment)  36 

Identification  Tait  (Perceptual  ipeed  and  accuracy)  36 

Target  Shoot  Tait  (Piychomotor  preclilon)  30 

NON-COGNITIVE  PAPER-AND-PENCIL  INVENTORIES 

Aiieisment  of  Background  and  Life  Experience!  (ABLE)  209 


Adjuitment 

Dependability 

Achievement 

Phyilcal  Condition 

Ltaderihlp 

Locu*  of  Control 

Agraeabl  eneia/U  keabl  1 1  ty 

Army  Vocational  Intereit  Caraer  Examination  (AVOICE)  176 

Realistic  Interesti 
Conventional  Interest! 

Social  Intereiti 
Enterprising  Intereiti 
Artistic  Interests 


Table  IV. 3 


Summary  of  Criterion  Measures  Used  in  Batch  A  and  Batch  Z 
Concurrent  Validation  Samples 


•  Army-Wlde  Rating  Seales  (ill  obtained  from  both  supervisors  and 

patrt). 


Tan  behavioral  1 >  anchored  rating  steles  (BARS) 
dttlgntd  to  manure  factor*  of  non-Job-speelf tc 
performance. 


«  Single  teal*  rating  of  overall  affaetlvanatt. 
>  Single  teal*  rating  of  NCO  potential, 
a  Combat  prediction  scale  containing  41  Item*. 


a  Ptper-end-Penell  Teat  of  Training  Achievement  developed  for  each 
of  the  19  H0$  (139-210  Item*  each). 


a  Peraonnel  Pile  Information  form  developed  to  gather  objective 
archival  records  data  (awards  and  lattarit  rifle  marksmanship 
scores,  physical  training  scores,  etc.). 


■»Ui  I'U  l  ■i-IJi  1  iHiM.WJil*  4 


e  Job  Sample  (Hands-On)  tents  of  NOS-ipedfle  task  proficiency, 


-  Individual  Is  tested  on  each  of  IS  major  job  tasks  (n 
an  MOS. 


a  Paper-and-penctl  Job  knowledge  tests  designed  to  measure 
task-sped  fie  job  knowledge. 


Individual  Is  scored  on  ISO  to  200  multiple-choice 
Items  representing  30  major  job  tasks.  Ten  to  IS  of 
the  tasks  were  also  measured  hands-on. 


e  Rating  scale  measures  of  sped  fie  task  performance  on  the  IS 
tasks  alto  measured  with  the  knowledge  tests,  Mott  of  the  rated 
tasks  were  also  Included  In  the  hands-on  measures. 


e  MOS-spadflc  behav  1  oral  1  y  anchored  rating  scales  (DARS).  From  6 
to  10  BARS  were  developed  for  each  MOS  to  represent  the  major 
factors  that  constitute  Job-specific  technical  and  task 
proficiency. 
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#  Army-Wide  Rating  Scales  (all  obtained  from  both  supervisors  and 
peers). 


Ratings  of  parfomince  on  11  common  tasks  (o.g,,  basic 
first  eld). 


Single  scale  rating  on  performaneo  of  specific  Job 
duties. 


Auelllery  Meesures  Included  In  Criterion  Betterv 


e  A  Job  History  Questlonnilr*  which  esks  for  Information  about 
frequency  and  reeoncy  of  performence  of  tho  MOS-speelflc  tasks. 


t  Work  Questionnaire  -  •  44-Item  questionnaire  scored  on  14 
dimensions  descriptive  of  the  job  environment. 


•  Measurement  Method  Rating  obtained  from  ell  participants  at  the 
end  of  the  final  testing  session. 


& 


Section  2 


CONCURRENT  VALIDATION:  SAMPLES  AND  PROCEDURES 


The  original  Project  A  Research  Plan  specified  a  Concurrent  Validation 
target  sample  size  of  600-700  job  Incumbents  for  each  of  the  19  MOS,  and  a 
tentative  starting  date  of  1  May  1985,  using  procedures  that  had  been  tried 
out  and  refined  during  the  predictor  and  criterion  field  tests.  The  Research 
Plan  specified  13  data  collection  sites  In  the  United  States  (CONIJS)  and  two 
In  Europe  (USAREUR).  The  number  of  sites  was  the  maximum  that  could  be 
visited  within  the  project's  budget  constraints,  which  dictated  that  sites 
should  be  chosen  to  maximize  the  probability  of  obtaining  the  required  sample 
sizes. 

The  data  collection  actually  began  10  June  1905  and  was  not  yet 
concluded  by  the  end  of  FY85  (30  September  1985).  The  projected  schedule,  by 
site,  Is  shown  In  Figure  IV. 1.  Although  the  starting  date  shifted  slightly, 
It  was  still  within  the  permissible  "window"  that  would  maintain  the 
project's  original  schedule. 

The  data  were  collected  by  on-site  teams  made  up  of  project  staff.  Each 
square  in  Figure  IV. 1  represents  1  week  of  one  team's  time.  For  example, 
during  the  week  of  7  July  seven  teams  were  operating,  one  at  each  of  seven 
posts. 


Fort  Lawla 
Fort  Banning 
Fort  Alloy 
Fort  Carton 
Fort  Hood 
Fort  Stawart 
Fort  Bragg 


10  .lun  >  II  Jul 
17  Jun  •  M  Jul 

•  Jul  •  0  Aug 

•  Jul  ■  23  Aug 
I  Jul  «  27  Aug 
22  Jul  -  30  Aug 
1  Aug  «  13  8ap 


Fort  Knox 
Fort  Sill 
Fort  Campbail 
Fort  Polk 
Fort  Bllta 
FortOrd 
USAREUR 


IS  Aug  - 13  Sop 
3  Sop  •  4  Oct 
3  Sap  >14  Oct 
30  Sap  -  13  Nov 
1  Ocl  - 11  Oct 
7  Oct  -  15  Nov 
12  Jul  •  8  Oct 
12  Jul  ►  9  Aug 
20  Sap  -  IB  Ocl 
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Figure  IV. 1.  Concurrent  Validation  schedule. 
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Since  data  collection  was  not  concluded  until  FYB6,  a  detailed 
description  of  the  Concurrent  Validation  procedure  and  results  must  wa , i  for 
a  future  report.  However,  the  basic  sampling  plan,  team  training,  and  data 
collection  procedures  are  summarized  below.  An  outline  of  the  planned  data 
analysis  steps  Is  presented  In  the  following  section  (l.e.,  Section  3  of  Part 
IV). 


Cross-Validation  Sample 

The  general  sampling  plan  was  to  use  the  Army's  World-Wide  Locator 
System  to  Identify,  at  the  specified  sites,  all  the  first- term  enlisted 
personnel  In  Batch  A  and  Batch  2  MOS  who  entered  the  Army  between  1  July  1983 
and  31  July  1984,  If  possible,  the  Individual's  unit  Identification  was  also 
to  be  obtained.  The  steps  described  belo«<  were  then  followed.  The  Intent 
was  to  be  as  representative  as  possible  while  preserving  enough  cases  within 
units  to  provide  a  "within  rater"  variance  estimate  for  the  supervisor  and 
peer  ratings. 


Sampling  Plan:  Concurrent  Validation 


A.  Preliminary  activities! 

1.  Identify  the  subset  of  MOS  (within  the  sample  of  19)  for  which 
It  would  be  possible  *9  actually  sample  people  within  units  at 
specific  posts.  That  Is,  given  the  entry  date  "window"  and 
given  that  only  50-75  percent  of  the  people  on  amv  list  of 
potential  subjects  could  actually  be  found  and  tested,  what 
MOS  are  large  enough  to  permit  sampling  to  actually  occur? 

List  them. 

2.  For  each  MOS  In  the  subset  of  MOS  for  which  sampling  Is 
possible,  identify  the  smallest  "unit"  from  which  6-10  people 
can  be  drawn.  Ideally,  we  would  like  to  sample  4-6  units  from 
each  post  and  6-12  people  from  each  unit.  For  the  total 
concurrent  sample  this  would  provide  enough  units  to  avc-raqe 
out  or  account  for  differential  training  effects  and 
leadership  climates,  while  still  providing  sufficient  degrees 
of  freedom  for  Investigating  withln-group  effects  such  as 
rater  differences  In  performance  appraisal. 

3.  For  the  four  MOS  in  the  Preliminary  Battery  sample,  Identify 
the  members  of  the  PB  sample  who  are  on  each  post. 

B,  The  Ideal  Implementation  would  be  to  obtain  the  Alpha  Roster  list 
of  the  total  population  of  people  at  each  post  who  are  In  the  19 
MOS  and  who  fit  our  "window".  The  lists  would  be  sent  to  HumRRO 
where  the  following  steps  would  be  carried  out: 

1.  For  each  MOS,  randomize  units  and  -^ndomize  names  within 
units. 


2.  Select  a  sample  of  units  at  random.  The  number  would  be  large 
enough  to  allow  for  some  units  being  truly  unobtainable  at  the 
time  of  testing. 

3.  Instruct  the  Point-of-Contact  (POC)  at  the  post  to  obtain  the 
required  number  of  people  by  starting  at  the  top  of  the  list 
and  working  down  (as  in  the  Batch  A  field  test)  within  each  of 
the  designated  units.  If  an  entire  unit  is  unavailable,  go  on 
to  the  next  one  on  the  list. 

4.  In  those  MOS  for  which  unit  sampling  is  not  possible,  create  a 
randomized  list  of  everyone  on  the  post  who  fits  the  window. 
Instruct  the  IOC  to  obtain  the  required  number  by  going  down 
tile  list  from  top  to  bottom  (as  In  the  Batch  A  field  tests). 

C.  If  It  Is  not  possible  to  bring  the  Alpha  Rostor  to  HumRKO,  provide 
project  staff  at  the  post  to  assist  the  POC  in  carrying  out  the 
aboue  steps. 

1.  If  it  is  not  possible  to  randomize  names  at  the  post, 
firsts  use  the  World-Wide  Locator  to  obtain  a  randomized  list, 
carry' the  list  to  the  post  and  use  it  to  sample  names  from 
units  drawn  from  a  randomized  list  of  units.  If  there  are 
only  six  to  eight  units  on  the  post,  then  no  sampling  of  units 
Is  possible.  Ut-.e  them  all. 

D.  If  it  Is  not  possible  for  project  personnel  to  visit  the  pest,  then 
provide  the  randomized  World-Wide  Locator  list  to  the  POC.  Ask  him 
or  her  to  follow  the  sampling  plan  described  above;  supply  written 
and  telephone  assistance.  That  is,  the  POC  would  Identify  a  sample 
of  units  (for  those  MOS  for  which  this  Is  possible),  match  the  unit 
roster  with  the  randomized  World-Wide  Locator  list,  and  proceed 
down  each  unit  until  the  required  number  of  people  was  obtained. 
If  the  POCs  can  generate  their  own  randomized  list  from  the  Alpha 
Roster,  so  much  the  better.  The  World-Wide  Locator  serves  only  to 
specify  an  a  priori  randomized  list  for  the  POC. 

E.  If  none  uf  the  above  options  Is  possible,  then  present  the  POC  with 
the  sampling  plan  ana  Instruct  him  or  her  to  obtain  the  required 
number  of  people  in  the  most  representative  way  possible  (the  Hatch 
B  procedure) . 


Actual  Samples  Obtained 

The  final  sample  sizes  are  shown  by  post  and  by  MOS  in  Figure  IV. 2. 
Note  that  it  was  not  always  possible  in  all  MOS  to  find  as  many  as  600 
incumbents  with  the  appropriate  accession  dates  at  the  15  sites.  Some  MOS 
simply  are  not  that  big. 


BATCH  A  BATCH  Z 
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Data  Collection  Team  Composition  and  Training 

Team  exposition 

Each  data  collection  team  was  composed  of  a  Test  Site  Manager  (TSM)  and 
six  or  seven  project  staff  members  who  were  responsible  for  test  and  rating 
scale  administration.  The  teams  v/ere  made  up  of  a  combination  of  regular 
project  staff  and  Individuals  specifically  recruited  for  the  data  collection 
effort  (e.g.,  graduate  students).  The  test  site  manager  was  an  "old  hand" 
who  had  participated  extensively  In  the  field  tests.  This  team  was  assisted 
by  eight  NCO  scorers  (for  the  hands-on  tests),  one  company-grade  officer  POC, 
and  up  to  five  NCO  support  personnel,  all  recruited  from  the  post. 


Team  Training 

The  project  data  collection  teams  were  given  3  days  of  training  at  a 
central  location  (Alexandria,  VA).  During  this  period,  Project  A  was  ex¬ 
plained  In  detail,  Including  Its  operational  and  scientific  objectives. 
After  the  logistics  of  how  the  team  would  operate  (transportation,  meals, 
etc.)  were  discussed,  the  procedures  for  data  entry  from  the  field  to  the 
computer  file  were  explained  In  some  detail.  Emphasis  was  placed  on  reducing 
data  entry  errors  at  the  outset  by  correct  recording  of  responses  and  correct 
Identification  of  answer  sheets  and  diskettes. 

Next,  each  predictor  and  criterion  measure  was  examined  and  explained. 
The  trainees  took  each  predictor  test,  worked  through  samples  of  the  knowl¬ 
edge  tests,  and  role  played  the  part  of  a  rater.  Considerable  time  was  spent 
on  the  nature  of  the  rating  scales,  rating  errors,  rater  training,  and  the 
procedures  to  be  used  for  administer Ing  the  ratings.  All  administrative 
manuals,  which  had  been  prepared  In  advance,  were  studied  and  pilot  tested, 
role  playing  exercises  were  conducted,  and  hands-on  Instruction  for  mainte¬ 
nance  of  the  computerized  test  equipment  was  given. 

The  Intent  was  that  by  the  end  of  the  3-day  session  each  team  member 
would  (a)  be  thoroughly  familiar  with  all  predictor  tests  and  performance 
measures,  (b)  understand  the  goals  of  the  data  collection,  (c)  have  had  an 
opportunity  to  practice  administering  the  Instruments  and  to  receive  feed¬ 
back,  and  (d)  be  committed  to  making  the  data  collection  as  error-free  as 
possible. 


Hands-On  Scorer  Training 

As  noted  above,  eight  NCO  scorers  were  required  for  Hands-On  test  scor¬ 
ing.  They  were  recruited  and  trained  using  procedures  very  similar  to  those 
used  at  each  post  In  the  criterion  field  tests.  Training  took  place  over  1 
full  day  and  consisted  of  (a)  a  thorough  briefing  on  Project  A,  (b)  an  oppor¬ 
tunity  to  make  the  tests  themselves,  (c)  a  check  of  the  specified  equipment, 
and  (d)  multiple  practice  trials  In  scoring  each  task,  with  feedback  from  the 
project  staff.  The  Intent  was  to  develop  high  agreement  for  the  precise  re¬ 
sponses  that  would  be  scored  as  CO  or  NO-GO  on  each  step. 
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Data  Collection  Procedure 

The  Concurrent  Validation  administration  schedule  for  a  typical  site 
(Fort  Stewart,  Georgia)  Is  shown  In  Figure  IV. 3.  The  first  day  (22  Jul  35) 
was  devoted  to  equipment  and  classroom  set-up,  general  orientation  to  the 
data  collection  environment,  and  a  training  and  orientation  session  for  the 
post  POC  and  the  NCO  support  personnel. 


Fort  Stewart,  6A 
Concurrent  Validation 
22  July  -  30  August  1985 

Soldiers  for  2  Days*  (Batch  A)  (Batch  Z)  Soldiers  for  1  Day* 
Groups  of  15  1234  12 


22  July  AM  Training/Orientation  for  Data  Collection 
PM  "  " 


30  12B  10-20  Soldiers 


23  July  AM 

P 

K/R 

PM 

,  R/K 

P 

32  27E_ 

10-20  Soldiers 

24  July  AM 

K/R 

P 

PM 

P 

R/K 

30  55B 

10  -  20  Soldiers 

25  July  AM 

P 

K/R 

PM 

R/K 

P 

30  55B. 

10-20  Soldiers 

26  July  AM 

K/R 

P 

PM 

P 

R/K 

30  76W 

10-20  Soldiers 

29  July  AM 

P 

K/R 

PM 

R/K 

P 

30  76W 

10-20  Soldiers 

30  July  AM 

K/R 

P 

PM 

P 

R/K 

30  94B 

10-20  Soldiers 

31  July  AM 

P 

K/R 

PM 

R/K 

P 
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Groups  of  1 i 


Soldiers  for  2  Days*  (Batch  A)  (Batch  Z)  Soldiers  for  1  DaV 
1  2  3  4  1  2 


30  31C  10-20  Soldiers 


30  76Y  10-20  Soldiers 


22  Aug 

AM 

HO 

K3R1 

K/R  P 

PM 

R1K5 

HO 

P  R/K 

23  Aug 

AM 
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P 

Train 
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P 
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AM 
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HO 

K3R2 

P 

Train  8  19E  Scorers 

PM 

K3R2 

P 
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60  19E  10-20  Soldiers 

28  Aug 

AM 

K5R1 

P 

K3R1 

HO 

PM 

HO 
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P 

K3R1 

29  Aug 

AM 

K3R2 

HO 

R2<5 

P 

PM 

P 

K5R2 

HO 

R2<5 

30  Aug 

AM 

Make-up  Day 

PM 

m 

® 1 

*Leqend: 

IT-  Rating  Scales 

Rl  Batch  A  (Army-Wide,  M0S  BARS r  Job  History) 

R2  Batch  A  (Combat  Prediction,  Work  Questionnaire,  Personnel  File 
Information) 

R  Batch  Z  (Army-Wide,  Overall  Performance,  Common  Tasks,  Combat  Predic¬ 
tion,  Work  Questionnaire,  Personnel  File  Information) 

K  -  Knowledge  Tests 

K3  Training  Achievement  Tests 
K5  M0S  Task-Based  Tests 
P  -  Predictor  Tests 
Rating  Scales 


HO  -  Hands-On  Tests 


In  addition,  at  the  end  of  their  final  session,  all  soldiers  filled  out  the 
Measurement  Method  Rating  (MMR) 


Figure  IV. 4,  Sample  schedule  for  Concurrent  Validation  administration, 

(Page  4  of  4) 


IV- 14 


a 


SI 


Vi 

K 


3 


I 


Hi 


Section  3 


CONCURRENT  VALIDATION:  ANALYSIS  PLAN 


The  analysis  plan  for  the  Concurrent  Validation  data  Is  outlined  In  this 
section.  The  overall  goal  Is  to  move  systematically  from  the  raw  data,  which 
consist  of  thousands  of  elements  of  Information  on  each  Individual,  to 
estimates  of  selection  validity,  differential  validity,  and  selection/ 
classification  utility. 


General  Steps 

The  overall  analysis  plan  consists  of  the  following  steps: 

1.  Prepare  and  edit  Individual  data  files. 

2.  Determine  basic  scores  for  the  predictor  variables. 

3.  Determine  basic  scores  for  the  criterion  variables. 

4.  Describe  the  latent  structure  of  the  predictor  and  criterion 

covariance  matrices. 

5.  Determine  how  well  each  predictor  predicts  each  criterion  variable 
(for  each  MOS). 

6.  Determine  how  well  predictive  relationships  generalize  across 
criterion  constructs  and  across  MOS. 

7.  Examine  subgroup  differences  In  the  predictive  relationships  and 
their  generalizations. 

8.  Evaluate  alternative  sets  of  predictors  In  terms  of  maximizing 

classification  efficiency  while  minimizing  any  predictive  bias. 


Data  Preparation 


For  Initial  processing,  the  data  from  the  field  are  being  divided  Into 
the  following  groups: 

Predictor  Measures  - 

•  Computer  Tests  -  diskettes  are  sent  to  project  staff  for 
process! ng. 

•  Paper-and-PencI 1  Tests  -  booklets  sent  to  vendor  for  scanning. 
Criterion  Measures  - 

•  Hands-on  Measures  score  sheets  sent  to  project  staff  for 
keypunchi ng. 


•  Ratings,  Knowledge  Tests,  Background,  Job  History,  Work  Ques¬ 
tionnaire,  Method  Measurement,  Personnel  File  Information  -  sent 

together  to  vendor  for  scanning. 

The  Roster  Control  File  will  be  m^ged  with  the  most  recent  Enlisted 
Master  Files  extracts  for  the  FY83/8*  .v }  >,,-t  and  with  Applicant/Accession 
files.  Any  unmatched  cases  will  be  f » v •  f  yoked  for  Incorrect  identifiers. 


Score  Generation 


While  the  data  are  still  separated  into  the  different  types,  Initial 
score  variables  will  be  generated.  For  the  paper-and-pencil  tests,  we  will 
generate  number  correct  and  number  omitted  scores,  as  was  done  in  the  field 
tests.  Subsequent  revisions  to  these  scores  will  be  implemented  to  reflect 
any  scoring  changes  suggested  during  item  analyses  and  as  part  of  subsequent 
outlier  analyses. 

For  the  non-cognltive  predictor  tests,  scale  scores  will  be  generated  In 
the  same  manner  as  was  done  for  the  field  test.  A  missing  data  screen  will 
be  implemented  Identifying  any  score  where  more  than  10%  of  the  component 
Items  were  omitted. 

For  the  computer-administered  tests,  response  time,  error,  and  other 
derivative  scores  will  be  generated  as  per  the  guidance  from  the  field  test 
results. 

For  the  hands-on  measures,  the  percentage  of  steps  passed  will  be  com¬ 
puted  for  each  task.  We  will  also  compute  the  number  of  steps  not  scored  for 
each  task.  If  more  than  10%  of  the  steps  were  not  scored  for  a  given  sol¬ 
dier,  the  task  score  will  be  Identified  as  missing;  otherwise,  scores  for  the 
missing  steps  will  be  Imputed  as  described  under  Missing  Values  below. 

For  the  rating  data,  adjusted  ratee  means  will  be  computed  for  each 
rating  scale  as  was  done  In  t.he  field  test.  A  separate  file  of  the  indi¬ 
vidual  ratings  will  be  maintained  for  reliability  and  other  analyses. 

Initial  summary  measures  will  be  defined  for  each  data  collection  method 
(hands-on,  rating,  and  knowledge  test)  on  the  criterion  side.  These  measures 
will  be  means  of  task  scores  or  rating  scales  and  will  serve  as  performance 
factor  scores  pending  more  precise  definitions  of  performance  composites. 


Outlier  Analysis 

Outlier  analysis  will  be  conducted  for  each  of  the  predictor  and  cri¬ 
terion  score  variables.  For  test  data,  distribution  of  the  number  correct 
and  number  omitted  will  be  examined.  Residual  errors  in  predicting  these 
scores  from  other  variables  will  also  be  examined.  Any  residuals  larger  than 
three  standard  errors  will  be  questioned. 


For  the  non-cognltl ve  predictor  measures,  additional  screening  will  be 
performed.  The  ABLE  Includes  built-in  validity  scales.  The  cutoffs  used 
with  the  field  test  data  will  be  reviewed  with  project  staff.  The  default 
will  be  to  use  the  same  cutoffs  and  Identify  all  of  the  ABLE  scale  values  as 
missing  for' any  soldier  exceeding  the  cutoff. 

For  the  rating  data,  we  will  conduct  outlier  analysis  and  Individual 
rater  adjustments  In  the  same  way  that  we  did  for  the  field  test  data.  This 
procedure  Involves  looking  at  the  correlation  of  each  rater's  ratings  with 
the  mean  of  all  other  raters'  ratings  of  the  same  soldiers  and  also  looking 
at  the  mean  signed  error.  Outliers  are  Identified  In  terms  of  these 
statistics  and  also  in  terms  of  measures  of  halo  (lack  of  variability  across 
dimensions). 


Missing  Values 


Because  extensive  multivariate  analyses  raqulring  complete  data  will  be 
performed,  the  treatment  of  missing  values  Is  an  Important  concern,  much  more 
so  than  was  the  case  with  the  field  test  data.  Prior  efforts  have  amounted 
to  substituting  either  examinee  means  or  variable  means  for  missing  values. 
We  plan  to  use  PROC  IMPUTE  to  derive  proxy  values  for  missing  scale  scores 
(and  for  missing  step  scores  In  the  hands-on  analyses). 

This  procedure  essentially  substlt  ' es  a  value  observed  for  a  respondent 
who  was  very  similar  to  the  examinee  with  the  missing  variable.  This 
procedure  has  been  shown  to  be  significantly  better  than  straight  regression 
procedures  (e.g.,  BMDPAM)  In  reproducing  correlation  and  variance  estimates, 
as  the  regression  approaches  tend  to  underestimate  variances  and  to 
spuriously  Inflate  correlations. 

Data  completion  flags  will  be  generated  for  each  battery.  For  each  set, 
the  flag  will  indicate  whether  the  data  are  complete,  partially  complete  and 
partially  missing  or  Imputed,  or  entirely  missing  or  Imputed. 


Predictor  Score  Analyses 

After  data  preparation,  basic  item  analyses,  and  the  Initial  score 
generation,  the  principal  objectives  for  tho  predictor  analyses  are  to 
generate  the  basic  summary  scores  that  will  enter  the  Initial  prediction 
equation  in  each  MOS,  examine  the  latent  structure  of  those  scores,  and 
determine  MOS  and  subgroup  differences.  The  basic  steps  are  as  follows: 

1.  Using  the  initial  scores,  Items/scale  score  analyses  will  be 

conducted  as  the  final  opportunity  to  (a)  Identify  faulty  Items, 
(b)  revise  the  scale  by  eliminating  seme  Items,  and  (c)  arrive  at 
the  final  array  of  predictor  scores  that  will  be  entered  in  the 
predictor  Intercorrelation  matrices. 


2.  Scale  reliabilities  and  descriptive  statistics  will  then  be 
computed. 


3.  The  latent  structure  of  the  predictor  space  will  be  examined  via 
confirmatory  factor  analysis.  Factor  structure  matrixes  based  on 
the  field  test  data  were  estimated  for  each  of  the  predictor 
domains.  Once  the  scale  scores  have  been  created  and  their 
reliabilities  estimated,  we  will  proceed  to  confirmatory  factor 
analysis.  We  anticipate  a  hierarchy  of  predictor  factors.  At  the 
most  detailed  level  the  factors  will  Include: 

Cognitive  Factors  -  Verbal  (from  ASVAB),  Quantitative,  Technical 
Knowledge,  Speed,  Visualization 

Psychomotor  Factors  -  Reaction  Time,  One-Hand  Tracking,  Two-Hand 
Tracking,  etc. 

Non-cognltlve  Factors  -  Surgency,  etc. 

LISREL  runs  will  be  examined  to  determine  the  goodness  of  fit  of 
hypothesized  factor  structures.  In  addition  to  testing  an  a  priori 
model,  we  will  look  for  potential  simplifications  (loadings  not 
significantly  different  from  zero)  and  also  consider  potential 
Improvements  In  fit  by  adding  additional  dimensions  or  additional 
loadings  on  the  existing  dimensions.  In  considering  modifications 
to  the  original  model,  Interpretablllty  will  be  given  somewhat  more 
weight  than  empirical  statistics. 

4.  Predictor  factor  (construct)  scores  will  be  estimated  with  a  least 
squares  procedure.  The  chief  alternatives  would  be  weighted  sums 
(means)  on  the  one  hand  and  a  more  complex  maximum  likelihood  or 
multistage  least  squares  approach  to  factor  score  estimation. 

5.  MOS  differences  In  the  predictor  constructs  will  be  examined.  The 
purpose  of  these  analyses  will  be  to  see  what  kind  of  applicants  go 
Into  (and  remain  In)  the  different  MOS.  Particular  attention  will 
be  focused  on  the  Interest  and  temperament  measures.  This  work  will 
be  necessarily  exploratory  since  only  concurrent  data  will  be  avail¬ 
able  for  most  measures.  Special  analyses  with  the  Preliminary 
Battery  MOS  will  attempt  to  determine  whether  the  same  construct 
differences  existed  prior  to  training. 

6.  Subgroup  analyses  will  be  run  separately  for  each  of  the  MOS  with  at 
least  100  In  each  of  the  different  race  or  gender  groups.  At  this 
stage,  we  are  just  looking  for  mean  differences  In  the  predictor 
scores,  not  for  differences  In  regression  slopes. 


Criterion  Score  Analyses 

After  data  preparation  has  been  completed,  the  objectives  for  the 
criterion  analyses  are  to  Identify  an  array  of  basic  criterion  variables 
( 1 . e. ,  scores),  Investigate  the  latent  structure  of  those  variables,  and 
determine  the  criterion  construct  scores.  The  following  steps  will  be  taken: 

1.  A  final  set  of  Items  by  a  priori  scale  analyses  will  be  used  to 
Identify  faulty  or  mlsplacea  items.  At  this  point  the  number  of 
criterion  variables  will  still  be  too  large  to  enter  Into  an 


intercorrelation  matrix.  For  example,  the  job  knowledge  test  will 
still  contain  30  task  scores,  and  the  number  of  individual  rating 
scales  will  still  be  quite  large. 

2.  A  more  manageable  set  of  basic  criterion  scores  will  be  obtained  by 
factoring/clustering  rating  scales,  hands-on  test  steps,  and  knowl¬ 
edge  test  Items.  In  general,  exploratory  factor  analysis  will  be 
used  to  reduce  the  Individual  rating  scales  to  clusters  of  scales 
that  will  be  averaged.  For  example,  analyses  of  the  field  test  data 
suggested  that  It  might  be  reasonable  to  group  the  11  Army-wide  BARS 
scales  Into  two  or  throe  clusters.  For  the  hands-on  and  knowledge 
tests,  Items  will  be  clustered  via  expert  judgment  sorts. 

3.  After  step  2  yields  a  basic  array  of  criterion  scores,  an  Intercor- 
relaticn  matrix  will  be  calculated  for  each  110$.  Exploratory  factor 
analyses  will  be  used  to  generate  hypotheses  about  the  latent  struc¬ 
ture  of  the  criterion  space. 

4.  The  "theories"  about  the  criterion  space  generated  In  step  3  will  he 
subjected  to  confirmatory  analyses  in  an  attempt  to  make  a  reason¬ 
able  choice  about  the  best-fitting  model  for  the  total  domain  of  job 
performance  for  each  MOS. 

5.  After  the  variables  that  comprise  each  criterion  factor  (construct) 
are  Identified,  factor  scores  will  be  generated  In  a  similar  fashion 
as  for  the  predictors. 


Predictor /Criterion  Interrelationships 


After  the  above  analyses  have  been  carried  out,  the  basic  variables  and 
the  best-fitting  model  for  both  the  predictors  and  the  performance  measures 
will  have  been  identified.  They  can  be  compared  to  the  "best  guess"  that  was 
offered  at  the  conclusion  of  tne  field  tests.  They  also  provide  the  vari¬ 
ables  to  be  used  for  establishing  the  selection/classification  validity  of 
the  new  predictor  battery  and  for  determining  differential  validity  across 
criterion  constructs,  across  jobs,  and  across  subgroups.  The  basic  picture 
and  the  analysis  steps  are  summarized  In  Figures  IV. 5  and  IV. 6. 


The  validity  analyses  will  begin*,  by  regressing  the  predictor  battery 
against  each  criterion  factor  on  each  MOS.  Ulthin  MOS  we  will  then  proceed 
to  determine  how  much  Information  Is  lost  by  reducing  the  number  of  predic¬ 
tion  equations  from  K,  the  number  of  criterion  factors,  to  just  one  equa¬ 
tion.  The  predictive  accuracy  of  the  composite  equation  can  then  be  compared 
to  an  equation  developed  for  tne  composite  criterion  measure  for  each  MCE. 
The  method  for  obtaining  the  criterion  composite  score  is  described  below. 


The  general  izabll ity  across  "OS  of  the  prediction  equation  for  each 
criterion  factor  will  be  determined  by  using  the  predictor  weights  developed 
for  one  MCS  to  compute  predicted  scores  for  the  same  criterion  factor  (e.g., 
peer  leadership  and  support)  In  each  of  the  other  MOS.  The  loss  In  predic¬ 
tive  Information  as  the  number  of  equations  is  reduced  from  19  to  1  will  be 
determined. 
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Figure  IV. 5.  Predictor  variable/criterion  factor  matrix. 
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A.  Within  MOS. 

1.  Camputt  "best"  prediction  equation  lor  each  criterion  lector. 

Z  Compute  best  prediction  equation  lor  overall  composite  acore. 

9.  Determine  lose  ol  predictability  ee  number  ot  eouetlone  Is  reduced  from  (1)  to  (2). 

4.  Determine  Incremental  validity  ol  Project  A  measures  (over  ASVAB). 

B.  Between  MOS. 

1.  Determine  generallsabillty  of  each  performance  lector's  prediction  equation  across  MOS. 
Z  Determine  generalise blllty  o!  composite  prediction  equatlonc  aorces  MOS. 

Z  Determine  generalise billty  ol  incremental  validity  across  MOB. 

C.  DlWerentlal  Prediction  Across  Qender/MIrorlty  Qroups  (within  selected  MOS). 

1.  Differential  prediction  lor  crltenon  specific  equations. 

Z  Dltterentlal  prediction  lor  criterion  composite  equation. 

D.  Estimation  ot  ClaesUlcatlon  Efficiency. 

1.  Brogdan/Harst  approximation. 

Z  Project  A  approximation. 


Figure  IV. 6.  Analysis  plan  for  predictor  variable  x  criterion  factor  matrl 
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For  each  major  prediction  equation  the  degree  of  differential  prediction 

across  racial  and  gender  groups  win  be  determined  for  MOS  In  which  sample 
sizes  are  sufficient.  For  equations  that  produce  significant  differential 
prediction,  the  predictors  that  seem  to  be  the  source  will  be  Isolated  and 
the  validity  of  the  equation  recomputed  with  those  variables  omitted. 

Also,  for  each  major-  prediction  equation  the  Incremental  validity  of  the 
Trial  Battery  compared  to  ASVAB  will  be  computed.  For  the  first  analyses  the 
degree  of  Incremental  validity  will  be  expressed  In  terms  of  variance 
accounted  for(<dR2).  Additional  Indexes  will  be  presented  In  the  form  of 
expectancy  charts  where  specific  base  rates  on  the  criterion  are  used  to 
compare  the  percentage  of  correct  predictions  using  ASVAB  and  ASVAB  plus  the 
Trial  Battery. 


Criterion  Composite  Scores 

At  the  operational  level,  a  selection  and  classification  system  makes 
two  fundamental  decisions  about  individuals:  to  select  or  not  select  into  the 
organization,  and.  If  selected,  to  choose  the  job  (MOS)  to  which  the 
individual  will  be  assigned.  For  the  Army  these  have  the  form  of  two 
sequential  single-choice  decisions.  Developing  the  appropriate  decision 
rules  and  estimating  the  overall  validity,  or  accuracy,  with  which  each 
decision  Is  made  requires  a  single  algorithm.  To  compute  a  single  validity 
estimate,  the  Individual  pieces  of  validity  Information  must  be  aggregated. 
One  straightforward  way  of  doing  this  Is-  to  compute  the  decision  rules  and 
validity  estimates  using  a  single  aggregate  or  composite  measure  of  job 
performance. 

For  each  MOS  an  overall  criterion  composite  score  will  be  computed  for 
each  Individual  In  the  following  way.  The  latent  structure,  or  criterion 
model,  which  receives  the  strongest  support  In  the  confirmatory  analysis 
will  be  used  to  designate  the  specific  measures  that  will  be  aggregated  to 
obtain  factor  or  construct  scores.  Since  the  observed  scores  that  enter  the 
confirmatory  analysis  will  already  have  been  through  a  considerable  amount  of 
Item  and  scale  aggregation,  the  most  straightforward  scoring  procedure  would 
be  to  use  an  unweighted  sum  of  standard  scores  to  obtain  a  construct  score. 

Once  the  latent  structures,  or  criterion  constructs,  for  each  MOS  have 
been  Identified  and  defined,  an  overall  composite  will  be  obtained  by  expert 
weighting.  To  accomplish  this,  a  series  of  workshops  will  be  held  with  NCOs 
and  company  officers  from  each  MOS,  and  the  NCOs  and  officers  will  act  as 
SMEs  to  scale  the  relative  importance  of  each  criterion  construct  for  overall 
performance.  Once  the  construct  weights  are  obtained,  they  can  be  used  to 
generate  a  weighted  composite  score. 

During  FY86,  several  different  scaling  methods  for  obtaining  Importance 
weights  will  be  tried  out  In  three  exploratory  workshops.  The  metnod  judged 
to  be  the  most  feasible  and  most  Informative  will  be  used  In  the  final 
criterion  weighting  workshops  which  will  begin  in  the  summer  of  1986.  The 
judgment  methods  being  explored  are  ratio  estimation,  paired  comparisons,  and 
conjoint  measurement.  Methods  will  be  compared  In  terms  of  their  ease  of 
use,  perceived  validity,  acceptability  to  the  Army,  and  Interjudge  agreement. 
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After  the  importance  weights  for  the  performance  factors  are  obtained, 
the  composite  scores  for  each  MOS  can  be  computed  and  used  to  derive  a  single 
prediction  equation  for  each  MOS.  An  additional  Issue  is  the  effect  of 
scenario  (e.g.,  European  conflict  vs.  peacetime)  on  the  construct  weights. 
Scenario  effects  will  also  be  explored  in  the  preliminary  workshops. 


A  full  determination  of  classification  validity  or  efficiency  requires  a 
common  performance  or  criterion  metric  across  all  jobs  (MOS).  If  the  anal¬ 
ysis  goal  Is  to  rank  order  the  validity,  or  efficiency,  of  alternative  clas¬ 
sification  systems,  then  the  precise  meaning  of  the  common  metric  Is  not 
crucial.  However,  If  an  organization  wants  to  evaluate  the  cost  vs.  benefit 
of  a  new  system,  then  ooma  sort  of  utility  metric  Is  necessary. 


In  the  private  sector,  great  deference  Is  paid  to  the  dollar  metric  as 
the  appropriate  utility  scale.  If  only  the  costs  and  benefits  of  personnel 
programs  could  be  portrayed  in  dollar  terms,  then  decisions  among  alterna¬ 
tives  would  be  straightforward,  or  so  It  Is  hoped.  Attempts  to  deal  with  the 
dollar  metric  In  personnel  research  Is  well  documented  (e.g.,  Boudreau,  1983; 
Casclo,  1982;  Hakel,  1986;  Hunter  &  Schmidt,  1982)  and  need  ntit  be  repeated 
here. 


The  general  procedure  that  Project  A  has  used  to  approach  this  problem 
Is  summarized  In  Sadacca  and  Campbell  (1985).  We  began  by  exploring  the 
problem  In  eight  workshops  with  8-12  field  grade  Army  officers  In  each 
workshop.  Each  workshop  Incorporated  both  a  general  discussion  of  the 
"utility  of  performance"  Issue  and  a  tryout  of  one  or  more  potential  scaling 
methods  with  which  to  scale  the  utility  of  MOS  x  performance  level  combi¬ 
nations  (e.g.,  an  MOS  1  IB  Infantryman  who  performs  at  the  70th  percentile). 
Five  performance  levels  were  defined  as  simply  the  ICth  percentile,  30th 
percentile,  50th  oercentlle,  70th  percentile,  and  90th  percentile  per¬ 
former.  The  findings  and  conclusions  from  this  first  series  of  workshops  are 
reported  in  Sadacca  and  Campbell.  The  more  important  conclusions  were  as 
follows: 


1.  For  an  organization  such  as  the  Army,  expressing  the  utility  of 
performance  In  dollars  maker,  little  conceptual  sense.  The  Army  is 
not  in  husiness  to  sell  a  product  or  service.  Its  job  is  to  be  in  a 
high  state  of  readiness  so  as  to  be  able  to  respond  to  external 
threats  of  ar  unpredictable  nature. 

2.  When  using  any  one  of  several  techniques  (e.g.,  ratio  estimation, 
paired  comparisons),  Army  officers  can  provide  reliable  and  con¬ 
sistent  utility  judgments.  There  does  seem  to  be  a  commonly  held 
value  system  about  the  relative  utility  of  specific  HUS  x 
performance  level  combinations. 

3.  When  participants  were  asked  to  describe  the  performance  behaviors 
of  individuals  at  the  different  performance  levels,  the  model 
descriptions  were  very  similar  to  the  behavioral  anchors  of  the  BARS 
scales  described  in  Part  III.  Again,  at  some  general  level,  there 
does  seem  to  be  more  common  understanding  of  what  performers  at 
different  levels  are  like. 


During  FY85  and  on  into  FY8S,  an  additional  series  of  five  utility 
scaling  workshops  were  conducted.  This  second  series  examined  specific 
scaling  Iss  us  and  evaluated  alternative  scaling  methods.  Workshop 
participants  were  asked  to  judge  the  difficulty  of  using  each  method  and  to 
state  their  perception  of  its  validity.  Methods  are  being  compared  in  terms 
of  their  time  demands  on  the  judges,  the  amount  of  Information  they  provide, 
and  the  degree  of  Interjudge  agreement. 

At  the  conclusion  of  FY85,  the  following  utility  scaling  methods  were 
still  In  the  process  of  being  compared  and  evaluated. 

•  Card  sort  -  A  sample  of  MOS  x  performance  level  combination  was 
sorted  Tnto  7-10  piles  such  that  there  wore  equal-appearing 
Intervals  between  piles  on  a  scale  defined  as  the  priority  for 
filling  force  strength  requirements.  A  variant  of  this  procedure 
was  to  designate  one  of  the  piles  as  having  zero  utility  for  the 
Arnqy .  Cards  sorted  Into  piles  below  that  point  represented 
combinations  with  negative  utility. 

•  Scaling  against  a  standard  -  This  was  a  ratio  estimation  technique 
in  which  a  particular  MOS  x  performance  level  combination  was 
assigned  a  utility  of  100  points  (e.g.,  a  90th  percentile  11B)  and 
the  remaining  combinations  were  assigned  points  proportionately. 

•  Paired  comparisons  (100  point  distribution)  -  The  relative  utility 
of  MoS  x  performance  level  combinations  presented  In  pairs  was 
judged  by  dividing  100  points  between  each  pair  such  that  the 
difference  represented  the  difference  In  relative  potential  utility 
to  the  Army. 

•  Faired  comparisons  (Equivalent  Manning  levels)  -  For  each  MOS  x 
performance  level  combination,  the  judges  were  asked  to  estimate 
the  number  of  Individuals  of  one  combination  that  It  would  take  to 
equal  some  given  number  of  the  second  combination  (e.g.,  X  number 
of  lJBs  at  the  50th  percentile  would  be  equivalent  to  Y  number  of 
71Ls  at  the  30th  percentile).  The  MOS  x  performance  level 
combination  pairs  wore  judged  Independently. 

The  exploratory  workshops  will  be  concluded  during  FY86.  After  that  the 
evaluation  Information  will  be  analyzed  and  the  two  most  promising  methods 
will  be  used  to  do  the  actual  scaling  of  MOS  performance-level  utility  during 
FY%  and  FY87,  The  next  steps  will  be  a  full-scale  proponent  review  of  the 
utility  scaling  results  and  a  series  of  Monte  Carlo  studies  to  determine  the 
effects  of  different  ranges  of  utility  scale  values  on  personnel  assignments. 

Three  crucial  Issues  must  be  resolved  during  FY86  and  FY87: 

•  First,  the  precise  nature  of  the  desired  metric  must  be 
incorporated  In  the  scaling  directions.  ror  example,  If 
ratio  estimation  against  a  standard  is  used,  then  the 
metric  could  be  In  terms  of  gains  or  losses  of  standard 
equivalents  (e.g.,  "The  gain  from  using  the  system  is 
equivalent  to  having  3,000  more  than  50th  percentile  llBs 
in  the  force."). 


Second,  the  Issue  of  average  vs.  marginal  utility  must  be 

addressed.  That  Is,  does  the  utility  of  a  specific 

MOS/performance  level  combination  change  as  more  and  more 
personnel  are  added  to  the  enlisted  force? 

Since  the  answer  seems  certain  to  be  yes,  an  appropriate 
scenario  must  be  developed  that  will  allow  scale  values  to  be 
determined  In  a  relatively  straightforward  manner. 


EPILOGUE 


This  volume  has  presented  the  first  3  years  of  work  on  the  Army  Selec¬ 
tion  and  Classification  Project.  At  this  point  a  full  array  of  selection/ 
classification  tests  and  new  measures  of  training  and  job  performance  have 
been  developed.  Also,  all  the  newly  developed  measures  have  been  adminis¬ 
tered  to  a  sample  of  nearly  10,000  job  Incumbents  In  a  Concurrent  Validation 
design.  This  may  very  well  be  the  richest  single  data  set  ever  generated  In 
personnel  psychology,  and  more  Is  yet  to  come.  What  Is  perhaps  even  more 
startling  Is  that  the  project  reached  this  point  on  schedule  and  with  the 
original  research  plan  Intact.  It  has  done  what  It  set  out  to  do  with  no 
compromises  In  the  original  objectives,  so  far. 

It  Is  also  true  that  the  work  to  date  has  been  largely  of  a  develop¬ 
mental  nature.  A  great  deal  of  time  and  energy  was  poured  Into  the 
painstaking  development  of  multiple  Instruments  for  multiple  jobs  and  Into 
the  planning  and  execution  of  the  data  collection  procedures.  All  this  has 
been  done  under  virtually  continual  evaluation  by  several  review  bodies. 
However,  It  Is  now  time  for  the  fun  part.  It  Is  the  time  to  analyze  the 
data,  to  determine  what  more  we  can  learn  about  performance  and  Its  ante¬ 
cedents,  and  to  plan  for  the  most  significant  data  collection  of  all,  the 
longitudinal  validation.  To  falter  now  would  be  a  disaster,  both  for  the 
operational  needs  of  the  Army  and  for  the  benefit  of  the  discipline. 
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